106 | 🕸️ 💰 Web Scraping

Brainyacts #106

In today’s Brainyacts we:

  1. dig into web scraping and why you should know more about it

  2. share a transformative Generative AI experience

  3. get 1 billion tokens!

  4. see how many times “generative AI” shows up in earnings calls

  5. cover how cheating is getting harder

  6. ponder if AI should give and take military orders

  7. see China lose and win in the realm of AI

  8. deep fake a Trump/Biden debate (ugh)

  9. get another OpenAI lawsuit (join the party!)

  10. share a selfie from Hamburg!

👋 to new subscribers!

To read previous posts, click here.

🕸️ 💰 Web Scraping: What You Need to Know

Recent headlines are bringing to the forefront the practice of ‘web scraping.’

For instance, OpenAI now finds itself in a legal quagmire over the industry-wide practice of web scraping – a method to gather vast amounts of data from the internet for training advanced AI programs like ChatGPT and DALL-E. Recent lawsuits, inclusive of a significant class action initiated last week against both OpenAI and its investor Microsoft Corp., alleging that the company has infringed upon various privacy, intellectual property, and anti-hacking laws by scraping the personal data of countless internet users.

tl;dr

  1. Web scraping, the practice of extracting large amounts of data from the internet, is crucial for the operation of AI models such as ChatGPT, Claude, Bard, and LLaMA, which use this data for training purposes.

  2. Recent concerns and legal action have been raised against data scraping practices in AI training, highlighting potential copyright and privacy issues. OpenAI, for example, has been sued over allegations of unlawfully copying text from books without consent and violating privacy laws through its data collection practices.

  3. Tech platforms such as Twitter have also started taking steps to protect their data from AI scraping, by implementing restrictions and rate limits on data access.

  4. Despite the backlash, Google confirmed that it continues to scrape data for AI training, even updating its privacy policy to reflect this.

  5. Experts believe that the growing debate around data scraping is inevitable given the rise of large language models. This conversation is intensified by actions taken by big tech companies like Google and Twitter.

  6. Many companies view their data as a competitive advantage and are seeking to restrict access to it and explore monetization opportunities.

  7. Using personal data in AI models presents unique privacy issues, such as a lack of transparency about how the data is used and potential harms that could arise from its use. Deleting or removing data from trained models can also be a complex issue.

  8. Even models that have already been trained, such as GPT-3 and GPT-4, could face regulatory or legal consequences for their past data usage.

  9. Discussions around the "fair use" of scraped data continue. While some interpret "fair use" as allowing copyrighted materials to be used in AI training, this concept is contentious and subject to interpretation by the courts.

Web Scraping 101

Web scraping has been a critical part of data collection for major search engines such as Google and Bing. These search engines use scripts, often written in programming languages like Python or Java, to collect massive amounts of data from the internet. This data is integral to the functioning of artificial intelligence (AI) applications, including generative models like GPT-4, whose utility and functionality heavily rely on extensive data. However, web scraping is not a simplistic process, nor is it without challenges. This guide will help you understand web scraping, its role in training AI models, and why it's crucial for stakeholders in global legal markets.

Why Does Web Scraping Matter?

Web scraping matters because it's a primary method for gathering data on a large scale, a necessity for building robust AI models. AI models learn from data, and the more data they have, the more accurate and nuanced their outputs can be. Whether we're talking about an AI model for predicting stock market trends or a language model for natural language processing tasks, web scraping provides the raw material for these models to learn and evolve.

Web scraping also matters from a business perspective. For instance, businesses can use it to monitor their competitors, analyze customer sentiment, observe market trends, gather data for machine learning models, and much more. It's a valuable tool for both businesses and researchers alike.

How AI Models and Companies Perform Web Scraping

Web scraping is performed using software tools known as 'crawlers' or 'spiders.' These tools systematically browse the internet, following links from page to page and collecting the specified data. However, this process isn't as straightforward as it seems. Websites are designed for human interaction, not machine readability. This complexity presents challenges when extracting and structuring data.

Enter AI-powered web scraping. Advanced AI algorithms help overcome the limitations of traditional scraping tools. They mimic human behavior to identify relevant data, extract it, and even clean and format it for later analysis. They can handle dynamic website content, bypass CAPTCHA systems, and avoid IP blocks, making the data collection process more efficient and robust.

Challenges in Web Scraping and Data Collection for AI Training

While web scraping has clear benefits, it also poses several challenges. These include:

1. Legal Challenges: Web scraping has faced legal challenges, with debates on copyright and privacy concerns. In the US, the question of fair use of copyrighted material in AI training is contentious. Social media platforms have tried to safeguard their data by restricting access and setting rate limits for viewing content.

2. Privacy Concerns: The use of personal data in AI models raises unique privacy concerns. Lack of transparency regarding how personal data is used and the potential harms from that use pose significant challenges.

3. Data Quality Issues: Web scraping can also lead to data quality issues, such as data errors, missing fields, duplicates, and diversity in online content publishing methods. These issues can affect the reliability and utility of the collected data.

4. Ethical Considerations: Ethical implications arise when web scraping is employed without due consideration of privacy norms, consent, and data ownership.

Click here and here for more.

🦄 🔋 Transformative Power of Generative AI

Today, I witnessed the transformative power of this technology.

I just wrapped up an awesome week in Hamburg, Germany, where I've been hanging out with and teaching some super cool lawyers and law students at Bucerius Law School. Many of them, who had tried ChatGPT or similar tools and remained indifferent, experienced a radical shift in their perspective. It was an ordinary day that rapidly evolved into an extraordinary dive into the expansive universe of generative AI.

When they began engaging with ChatGPT with a fresh, open mindset, their enthusiasm and curiosity sparked like a well-illuminated Christmas tree! The turning point was in realizing that the transformative power of AI shines brightest when it's used in a more profound, impactful manner. With conversational AI, it's not just about keying in rudimentary data or keywords. It's about igniting enriching conversations that bring invaluable insights and promote meaningful exchanges.

Over the course of this enlightening week, we navigated through intricate facets of legal service business models and business design. At times these participants explored the ethical dilemmas and technological origin story of generative AI from other professors.

Today, I had us focus on practical implementations. We rolled up our sleeves and experimented with AI in real-world scenarios, using existing case studies and innovating a few of our own. The experience was nothing short of mind-expanding!

What started as a routine exploration into the world of business design morphed into a journey of discovery and transformation. We were validating new business concepts. Exploring the possibilities of developing new types of immigration law practices/services, predictive client services, not just foreign language translation but seamless collaboration between two parties who do not speak the same language, and more.

They were actively using and integrating generative AI into their processes, using it as an assistance, mentor, and muse. It was helping them unlock new ideas, new perspectives while expanding their thinking. No ethical dilemmas. No confidential information was shared. Simply highly effective utility.

This reminds me of what I am encountering in the world daily: people who have either given up or disregarded generative AI. They believe it is because it is a useless tool. The reality is, they are using it poorly or outright wrong!

Which brings me to . . . 👇

How to reintroduce someone to Generative AI that has dismissed it

Some tips:

1. Address their concerns first: Understand their fears, biases, and misconceptions. Once you understand their concerns, address them directly and honestly. Provide them with evidence-based information to counter their misconceptions.

2. Tell stories: People connect with narratives. Share success stories that highlight the value of generative AI in various fields. It could be stories about how generative AI has improved the life of a solo lawyer, the effectiveness of marketing a practice group, or even personal stories of individuals benefiting from AI-generated personalized content, like writing emails to friends or coworkers based on rough and incomplete notes.

3. Connect with their interests: Find examples of how generative AI intersects with their job or life situation. If they have no time to think about new approaches to a client dilemma, share how AI can be used for scenario planning. If they crave more credibility and authority in their field, show them how generative AI can help them design a complete class to teach – from learning objectives to syllabus, to course outline, to speaking notes. This will make the information more relatable and less intimidating.

4. Demystify the technology: Make sure to explain complex concepts in simple, accessible language. Use analogies or metaphors that can help them understand. For instance, you might explain how a generative AI model works similarly to how humans learn: by recognizing patterns and making predictions based on those patterns.

5. Show don't just tell: If possible, give them a hands-on experience. Show them a generative AI tool in action or have them interact with one. This will give them a concrete sense of what it can do and how it can be used.

6. Stress on ethical use and regulation: Talk about the mechanisms in place for ethical use of AI. This can help alleviate fears about AI misuse. Discuss the regulations and safeguards that exist and those that are being developed to ensure that AI is used ethically and responsibly.

7. Emphasize the ubiquity of AI: Many people don't realize how much AI they already use in their daily lives, like recommendation algorithms on shopping websites or voice recognition on their phones. Highlighting these familiar instances of AI can make the concept of generative AI less foreign and scary.

For instance, I know many lawyers that use voice notes on their mobile phones or have their email app turn voice notes into text – but few have asked where that data is stored or goes or how it is used. Yet, we ask those questions of generative AI. Not saying they are bad questions! Just pointing out that familiarity breeds acceptance.

Back to today . . .

I have always cherished my experiences working with legal professionals and students in a learning setting. This week in Hamburg reaffirmed that sentiment. The enthusiasm for knowledge and the willingness to learn from each other created an invigorating environment. My gratitude to this remarkable cohort from around the globe for the shared experiences and lessons cannot be overstated.

This week has been a total blast. I always tap another source of energy and inspiration when I am working with lawyers and law students in a learning environment - and hey, I always make sure I'm learning right along with them. This group was something special. I've thanked them a bunch of times already, but I just gotta do it again here.

Check out this selfie I took just as our last class was wrapping up. Made some new friends this week, that's for sure! Thanks to Dan Katz and Dirk Hartung for having me here.

🚀🧩 One Billion Tokens!!

Remember, tokens are what these models work with and consume. While tokens aren't exactly words, for simplicity, think of them like words. This is because words get divided into tokens.

We've all faced the issue of demanding too much from these models. For example, most of them can't handle much more than a few pages of text. There are models that can handle more, but managing 1 billion is astounding.

So, why am I telling you this? It's to show you how rapidly we're addressing initial challenges. By essentially making the length of the context or prompt unlimited and not a problem, we're opening up limitless possibilities.

📈 🗣️ Generative AI: From the Lips of CEOS

In earning calls, “Generative AI” experienced a 129% increase in earning call mentions in Q2 2023 compared to the previous quarter.

Click here for more.

News you can Use: 

Was this newsletter useful? Help me to improve!

With your feedback, I can improve the letter. Click on a link to vote:

Login or Subscribe to participate in polls.

DISCLAIMER: None of this is legal advice. This newsletter is strictly educational and is not legal advice or a solicitation to buy or sell any assets or to make any legal decisions. Please /be careful and do your own research.8