The Brainyacts
Posts
224 | 📚🥷 Where to find pirated books to train AI

224 | 📚🥷 Where to find pirated books to train AI

Brainyacts #224

August 23, 2024

It’s Friday. Boston Dynamics (the maker of creepy animal robots) shows how its AI humanoid robot Atlas “warms up.”

Let’s dig in.

In today’s Brainyacts:

Anthropic sued for using The Pile - a pirated book dataset
Harvey releases user data
Elon releases Grok2 (and it is powerful!)and other AI model news
Deepfake Biden fine is $1million and more news you can use
👋 to all subscribers!

To read previous editions, click here.

Lead Memo

📚🥷 How AI models use pirated books to train their models.

Anthropic got sued by authors for training Claude using works that were included in a dataset of pirated books. You can read the Complaint HERE.

But how does this happen, you might ask? Did Anthropic buy a digital version of the books? Were the books freely and legally open and accessible on the internet? Well let’s take a moment to give you a glimpse into what is happening here and the basis of the claims in this lawsuit.

How Published Books Get Pirated

Pirating published books typically involves illegally copying and distributing digital versions of those books without the permission of the copyright holders. These pirated copies often find their way onto websites that allow users to download them for free. Sites like “Library Genesis” (LibGen) and “Bibliotik” are notorious for hosting vast collections of such pirated books, making them available to anyone with an internet connection. These collections are often referred to as “pirate libraries” and are distributed through file-sharing networks like BitTorrent.

What Is “The Pile”?

“The Pile” is an 800 GB+ open-source dataset designed specifically for training large language models, like those used in artificial intelligence (AI). This dataset was created by a nonprofit organization called EleutherAI, and it consists of 22 different subsets of text from various sources, including academic papers, websites, and, controversially, pirated books. One of the key components of The Pile is a subset called “Books3,” which contained a massive collection of pirated books. It has since been taken down.

How AI Companies Likely Use “The Pile” for Training

AI companies use datasets like The Pile to train their language models. These models learn by analyzing vast amounts of text data, which helps them understand and generate human-like language. Much of the data and content is fine to use. However, the use of The Pile, particularly the Books3 subset, raises significant ethical and legal concerns because it includes pirated books—works that are protected by copyright law.

In the case of Anthropic, it’s been revealed that they used The Pile, including Books3, to train their language model named Claude. This means that they likely trained their AI on a large amount of copyrighted material without obtaining proper licenses or permissions from the authors or publishers of those works. This has led to accusations that Anthropic, and possibly other AI companies, have relied on stolen content to develop their models, rather than legally sourcing the material.

Spotlight

👏 🤔 Did Harvey help or hurt itself?

Most you likely have hearf of Harvey.ai. It has emerged as a significant player but has come under fire for not being transparent in some ways. Well, it seems they are trying to address this.

They just released some user data, showcasing impressive utilization and retention metrics.

This is undoubtedly a move designed to bolster confidence in its platform. However, for those of us who prioritize pragmatic, data-driven insights, the release prompts as many questions as it answers.

Harvey’s reported figures are, on the surface, commendable. The platform claims a more than doubling of utilization rates over the past year, along with retention rates consistently hovering around 70% after one year of use.

Moreover, the data highlights successful integration within some of the largest law firms, with utilization rates in some cases exceeding 100%. These numbers suggest a platform that is not only gaining traction but is also becoming increasingly indispensable to its users.

Yet, it’s clear that the information provided, while promising, lacks the necessary context to be truly meaningful. Metrics such as “utilization” and “retention” are presented without clear definitions or benchmarks for comparison. We are told of “consistent and significant increases” and “exceptional retention,” but without a baseline or understanding of how these terms are defined, it’s difficult to gauge the true impact or relevance of these statistics.

For instance, what does Harvey mean by “utilization”? Is it measured by the frequency of logins, the volume of queries, or the depth of integration into workflows? Similarly, how is “retention” calculated? Without these clarifications, the data risks being more of a marketing tool than a genuine insight into the platform’s effectiveness.

Furthermore, the case studies of three BigLaw firms are intriguing yet raise additional questions.

The notion of “over-provisioning,” where utilization surpasses 100%, might be seen as a success, but it could also suggest a potential issue with resource allocation or a metric that does not adequately reflect actual use. These are the kinds of details that, if clarified, could offer deeper insights into how Harvey is truly being adopted and used across firms of varying sizes and specialties.

In a sector where credibility and precision are paramount, especially when it comes to adopting new technologies like generative AI, such ambiguities can undermine the very trust that companies like Harvey seek to build. Legal professionals need more than just numbers and graphs—they need context, clear definitions, and transparency about both successes and challenges.

Harvey’s data release is a step in the right direction, signaling a commitment to transparency. However, for it to be genuinely useful to the legal community, it must go beyond the surface. By providing greater detail and clarity, Harvey can better equip firms to make informed decisions about integrating generative AI into their practices.

AI Model Notables

► Midjourney, the text-to-image generator that previously lived only on Discord and thus confusing many potential users, finally opens website to all users, offering 25 free AI image generations.

► Soon you will be able to give your AI assistant some money and it will go buy things for you.

► xAI/Twitter has begun rolling out early beta access for Grok 2, a powerful new AI model that leverages real-time data from X and uses Flux.1 to generate relatively unfiltered AI images.

► Google DeepMind staff call for end to military contracts.

► Perplexity AI plans to start running ads in fourth quarter as AI-assisted search gains popularity.

► A new web crawler launched by Meta last month is quietly scraping the internet for AI training data.

► Microsoft’s controversial “Recall” feature needs more time for security testing. Remember this is a feature that basically reads and tracks everyrhing you read and do on your computer.

News You Can Use:

➭ Telecom will pay $1 million over deepfake Joe Biden robocall in settlement with FCC.

➭ The rise and fall of America’s AI mayoral candidate—and OpenAI’s mad dash to shut it down.

➭ McAfee introduces AI deepfake detection software for PCs.

➭ Another lawyer gets sanctioned for use fake case cites from ChatGPT.

➭ These are the jobs that AI is replacing: Amazon software engineers may be forced to learn skills besides coding thanks to AI.

➭ Uber teams up with Cruise to deliver more autonomous rides next year.

Was this newsletter useful? Help me to improve!

With your feedback, I can improve the letter. Click on a link to vote:

Who is the author, Josh Kubicki?

Some of you know me. Others do not. Here is a short intro. I am a lawyer, entrepreneur, and teacher. I have transformed legal practices and built multi-million dollar businesses. Not a theorist, I am an applied researcher and former Chief Strategy Officer, recognized by Fast Company and Bloomberg Law for my unique work. Through this newsletter, I offer you pragmatic insights into leveraging AI to inform and improve your daily life in legal services.

DISCLAIMER: None of this is legal advice. This newsletter is strictly educational and is not legal advice or a solicitation to buy or sell any assets or to make any legal decisions. Please /be careful and do your own research.8