- The Brainyacts
- Posts
- 217 | š¦¹ šµļø What is AI stealing from you?
217 | š¦¹ šµļø What is AI stealing from you?
Brainyacts #217
Itās Tuesday. Letās keep this simple.
Ok, time to dig in.
In todayās Brainyacts:
Web scraping and AI models: 101
Opting out of socials training AI on your posts
News publishers get paid by AI and other AI model news
Two AI deepfakes attempted against big companies and more news you can use
š to all subscribers!
To read previous editions, click here.
Lead Memo
š” š Why Are People Mad at Anthropic?
People are pretty upset with Anthropic right now. Their ClaudeBot web crawler went a bit overboard and hit a website nearly a million times in just one day. Imagine getting a million knocks on your door in 24 hours! This overzealous scraping not only violated the siteās Terms of Use but also clogged up their servers, making life tough for their team.
iFixitās CEO, Kyle Wiens, took to social media to call out Anthropic.
He also shared screenshots showing that even Anthropicās chatbot admitted their content was off-limits. Kyle wasnāt just mad about the stolen content; he was also frustrated because this scraping spree tied up iFixitās resources. He even invited Anthropic to discuss licensing their content properly, but it seems they havenāt had that chat yet.
What is Web Scraping?
Alright, letās break down web scraping. Web scraping is like sending little digital bots to browse websites and gather information. These bots collect data such as text, images, and links from web pages. Think of it as copying and pasting information but doing it on a massive scale and super-fast.
How AI Companies Use Web Scraping
AI companies are really into web scraping because they need loads of data to train their models. For instance, if youāre building an AI that can chat like a human, you need tons of examples of human conversations. Thatās where scraping comes in handyāit helps gather diverse and vast amounts of data from across the internet.
Why AI Companies Use Web Scraping
Data Collection: AI models thrive on data. The more data they have, the smarter they get. Web scraping is an efficient way to gather this data.
Improving AI Models: With more data, AI models can learn better and perform tasks more accurately.
Staying Updated: Web scraping helps keep AI models up-to-date with the latest information.
The Robots.txt Process to Control Scrapers
Companies do have control over what can be scraped and what cannot (sort of). Letās talk about robots.txt. This is a file that website owners use to tell web crawlers what they can and cannot do on their site. Itās like a āDo Not Disturbā sign for certain parts of a website.
Creating the File: Website owners put a robots.txt file in their siteās root directory.
Setting Rules: This file has rules that tell specific crawlers (user agents) whether theyāre allowed to access certain parts of the site.
Crawler Check: When a crawler visits, itās supposed to check this file first and follow the rules.
Following Orders: Good crawlers respect these rules, but not all do.
The Anthropic and iFixit Drama
So, back to the drama. Anthropicās ClaudeBot didnāt play by the rules. Despite iFixit having a robots.txt file, ClaudeBot ignored it and went on a scraping spree. This caused a major headache for iFixit, who had to add a crawl-delay extension to their robots.txt file to finally get ClaudeBot to back off.
And itās not just iFixit. Other sites like Read the Docs and Freelancer.com have also felt the wrath of aggressive web scrapers. The whole robots.txt thing is a bit of a mess right now.
AI companies, like Anthropic, keep deploying new scrapers. So, website owners often end up blocking old, inactive bots while the new ones slip through because theyāre constantly changing.
The Ongoing Battle
Tools like Dark Visitors are popping up to help website owners keep their robots.txt files up-to-date. But itās still a confusing and never-ending battle. The web scraping landscape is constantly shifting, making it tough for website owners to keep up and protect their content.
In the end, people are frustrated. They want AI companies to play fair, respect their rules, and maybe even talk about paying for the content theyāre using. Until then, the tug-of-war between web scrapers and website owners will continue.
Spotlight
š š How to stop social media from using your posts to train their AI
Some of you care. Others may not. Regardless, you should know that most social media companies are using your posts and content to train their AI models. Shocking! Not.
Well depending on the service, you might be able to stop them. Here are the ways to do it for X/Twitter and Meta/Facebook.
X/Twitter
X/Twitter, has automatically activated a setting that allows it to train its Grok AI on usersā posts. This new setting is on by default. Hereās how to turn it off:
Open up the Settings page on X on your desktop.
Select the āPrivacy and safetyā button.
Select āGrok.ā
Uncheck the box.
Meta/Facebook
For Users in the US and Countries Without National Data Privacy Laws
No Direct Opt-Out: Meta doesn't offer a specific opt-out feature for AI training.
Protect Your Privacy: But you can set your account to private to reduce the chance of your data being used.
Messaging Privacy: Meta claims it doesn't use private messages for AI training.
In-Platform Tools: You can delete personal information from chats with Meta AI using built-in tools.
For Users in the EU and UK
You have the right to object to your data being used for AI training. But believe it or not, you have to fill out a form!
Here's how to access the Form:
Log in to Facebook
Go to: Settings and Privacy > Privacy Center
Find: "How Meta uses information for generative AI models and features"
Click: "Right to object"
Submit Your Request:
Fill in the form
For the explanation, simply state: "I wish to exercise my right under data protection law to object to my personal data being processed."
Confirm your email address if required
Await Confirmation:
You should receive an email and Facebook notification about your request status.
This usually happens within minutes.
šØšØ Remember: While these steps can help, there's no guarantee that your data hasn't already been used for AI training.
AI Model Notables
āŗ Meta to pay $1.4 billion to settle Texas facial recognition data lawsuit.
āŗ Meta launches AI Studio, where anyone can create AI characters at ai.meta.com/ai-studio or in the Instagram app.
āŗ Amazonās new AI chips have performance that can be 40% to 50% higher compared to NVIDIAās, and their cost is supposed to be about half of the same models of NVIDIAās chips.
āŗ The āquasi-mergerā approach of BigTech getting AI resources accelerates as Google-parent Alphabetās partnership with AI firm Anthropic under investigation in the UK.
āŗ Perplexity is going to start paying news publishers when the platform provides answers to user questions that include the news publishersā content. Recently perplexity got called out by Fortune magazine for stealing their content.
News You Can Use:
ā Ferrari exec foils deepfake plot by asking a question only the CEO could answer.
ā A well-regarded security training company is the latest to fall victim to a long-running North Korean IT worker scam that uses AI in several ways to fool companies into hiring them ā both to make money for the country and to gain access to company IP and confidential information.
ā Nick Mason wants to use AI to create new Pink Floyd songs.
ā Google is using AI in an attempt to improve traffic flow in Boston.
ā With over 8000 power-hungry data centers globally, most in the US, can energy grids keep up with AIās unquenchable thirst for more power?
ā How to rank your website on ChatGPT (SearchGPT) as it competes with Google Search. In the edition of Brainyacts, I predicted that these AI models would be the new search interface. I am sorry to see that I was right as this will likely spoil the responses we get in many of the models or will have to pay a premium for ad-free access.
New study finds the AI companions actually can address loneliness. And in a case of perfect timing, here is the latest AI wearable designed specifically to be your friend.
Was this newsletter useful? Help me to improve!With your feedback, I can improve the letter. Click on a link to vote: |
Who is the author, Josh Kubicki?
Some of you know me. Others do not. Here is a short intro. I am a lawyer, entrepreneur, and teacher. I have transformed legal practices and built multi-million dollar businesses. Not a theorist, I am an applied researcher and former Chief Strategy Officer, recognized by Fast Company and Bloomberg Law for my unique work. Through this newsletter, I offer you pragmatic insights into leveraging AI to inform and improve your daily life in legal services.
DISCLAIMER: None of this is legal advice. This newsletter is strictly educational and is not legal advice or a solicitation to buy or sell any assets or to make any legal decisions. Please /be careful and do your own research.8