217 | 🦹 🕵️ What is AI stealing from you?

It’s Tuesday. Let’s keep this simple.

Ok, time to dig in.

In today’s Brainyacts:

Web scraping and AI models: 101
Opting out of socials training AI on your posts
News publishers get paid by AI and other AI model news
Two AI deepfakes attempted against big companies and more news you can use
👋 to all subscribers!

To read previous editions, click here.

Lead Memo

😡 😈 Why Are People Mad at Anthropic?

People are pretty upset with Anthropic right now. Their ClaudeBot web crawler went a bit overboard and hit a website nearly a million times in just one day. Imagine getting a million knocks on your door in 24 hours! This overzealous scraping not only violated the site’s Terms of Use but also clogged up their servers, making life tough for their team.

iFixit’s CEO, Kyle Wiens, took to social media to call out Anthropic.

He also shared screenshots showing that even Anthropic’s chatbot admitted their content was off-limits. Kyle wasn’t just mad about the stolen content; he was also frustrated because this scraping spree tied up iFixit’s resources. He even invited Anthropic to discuss licensing their content properly, but it seems they haven’t had that chat yet.

What is Web Scraping?

Alright, let’s break down web scraping. Web scraping is like sending little digital bots to browse websites and gather information. These bots collect data such as text, images, and links from web pages. Think of it as copying and pasting information but doing it on a massive scale and super-fast.

How AI Companies Use Web Scraping

AI companies are really into web scraping because they need loads of data to train their models. For instance, if you’re building an AI that can chat like a human, you need tons of examples of human conversations. That’s where scraping comes in handy—it helps gather diverse and vast amounts of data from across the internet.

Why AI Companies Use Web Scraping

Data Collection: AI models thrive on data. The more data they have, the smarter they get. Web scraping is an efficient way to gather this data.
Improving AI Models: With more data, AI models can learn better and perform tasks more accurately.
Staying Updated: Web scraping helps keep AI models up-to-date with the latest information.

The Robots.txt Process to Control Scrapers

Companies do have control over what can be scraped and what cannot (sort of). Let’s talk about robots.txt. This is a file that website owners use to tell web crawlers what they can and cannot do on their site. It’s like a “Do Not Disturb” sign for certain parts of a website.

Creating the File: Website owners put a robots.txt file in their site’s root directory.
Setting Rules: This file has rules that tell specific crawlers (user agents) whether they’re allowed to access certain parts of the site.
Crawler Check: When a crawler visits, it’s supposed to check this file first and follow the rules.
Following Orders: Good crawlers respect these rules, but not all do.

The Anthropic and iFixit Drama

So, back to the drama. Anthropic’s ClaudeBot didn’t play by the rules. Despite iFixit having a robots.txt file, ClaudeBot ignored it and went on a scraping spree. This caused a major headache for iFixit, who had to add a crawl-delay extension to their robots.txt file to finally get ClaudeBot to back off.

And it’s not just iFixit. Other sites like Read the Docs and Freelancer.com have also felt the wrath of aggressive web scrapers. The whole robots.txt thing is a bit of a mess right now.

AI companies, like Anthropic, keep deploying new scrapers. So, website owners often end up blocking old, inactive bots while the new ones slip through because they’re constantly changing.

The Ongoing Battle

Tools like Dark Visitors are popping up to help website owners keep their robots.txt files up-to-date. But it’s still a confusing and never-ending battle. The web scraping landscape is constantly shifting, making it tough for website owners to keep up and protect their content.

In the end, people are frustrated. They want AI companies to play fair, respect their rules, and maybe even talk about paying for the content they’re using. Until then, the tug-of-war between web scrapers and website owners will continue.

Spotlight

🛑 🔒 How to stop social media from using your posts to train their AI

Some of you care. Others may not. Regardless, you should know that most social media companies are using your posts and content to train their AI models. Shocking! Not.

Well depending on the service, you might be able to stop them. Here are the ways to do it for X/Twitter and Meta/Facebook.

X/Twitter

X/Twitter, has automatically activated a setting that allows it to train its Grok AI on users’ posts. This new setting is on by default. Here’s how to turn it off:

Open up the Settings page on X on your desktop.
Select the “Privacy and safety” button.
Select “Grok.”
Uncheck the box.

Meta/Facebook

For Users in the US and Countries Without National Data Privacy Laws

No Direct Opt-Out: Meta doesn't offer a specific opt-out feature for AI training.
Protect Your Privacy: But you can set your account to private to reduce the chance of your data being used.
Messaging Privacy: Meta claims it doesn't use private messages for AI training.
In-Platform Tools: You can delete personal information from chats with Meta AI using built-in tools.

For Users in the EU and UK

You have the right to object to your data being used for AI training. But believe it or not, you have to fill out a form!

Here's how to access the Form:

Log in to Facebook
Go to: Settings and Privacy > Privacy Center
Find: "How Meta uses information for generative AI models and features"
Click: "Right to object"
Submit Your Request:
Fill in the form
For the explanation, simply state: "I wish to exercise my right under data protection law to object to my personal data being processed."
Confirm your email address if required
Await Confirmation:

You should receive an email and Facebook notification about your request status.

This usually happens within minutes.

🚨🚨 Remember: While these steps can help, there's no guarantee that your data hasn't already been used for AI training.

AI Model Notables

► Meta to pay $1.4 billion to settle Texas facial recognition data lawsuit.

► Meta launches AI Studio, where anyone can create AI characters at ai.meta.com/ai-studio or in the Instagram app.

► Amazon’s new AI chips have performance that can be 40% to 50% higher compared to NVIDIA’s, and their cost is supposed to be about half of the same models of NVIDIA’s chips.

► The “quasi-merger” approach of BigTech getting AI resources accelerates as Google-parent Alphabet’s partnership with AI firm Anthropic under investigation in the UK.

► Perplexity is going to start paying news publishers when the platform provides answers to user questions that include the news publishers’ content. Recently perplexity got called out by Fortune magazine for stealing their content.

News You Can Use:

➭ Ferrari exec foils deepfake plot by asking a question only the CEO could answer.

➭ A well-regarded security training company is the latest to fall victim to a long-running North Korean IT worker scam that uses AI in several ways to fool companies into hiring them – both to make money for the country and to gain access to company IP and confidential information.

➭ Nick Mason wants to use AI to create new Pink Floyd songs.

➭ Google is using AI in an attempt to improve traffic flow in Boston.

➭ With over 8000 power-hungry data centers globally, most in the US, can energy grids keep up with AI’s unquenchable thirst for more power?

➭ How to rank your website on ChatGPT (SearchGPT) as it competes with Google Search. In the edition of Brainyacts, I predicted that these AI models would be the new search interface. I am sorry to see that I was right as this will likely spoil the responses we get in many of the models or will have to pay a premium for ad-free access.

New study finds the AI companions actually can address loneliness. And in a case of perfect timing, here is the latest AI wearable designed specifically to be your friend.

Was this newsletter useful? Help me to improve!

With your feedback, I can improve the letter. Click on a link to vote:

Who is the author, Josh Kubicki?

Some of you know me. Others do not. Here is a short intro. I am a lawyer, entrepreneur, and teacher. I have transformed legal practices and built multi-million dollar businesses. Not a theorist, I am an applied researcher and former Chief Strategy Officer, recognized by Fast Company and Bloomberg Law for my unique work. Through this newsletter, I offer you pragmatic insights into leveraging AI to inform and improve your daily life in legal services.

DISCLAIMER: None of this is legal advice. This newsletter is strictly educational and is not legal advice or a solicitation to buy or sell any assets or to make any legal decisions. Please /be careful and do your own research.8