OpenAI, Anthropic Ignore Rule That Prevents Bots Scraping Web Content
From Business Insider: 2024-06-21 18:04:00
The top AI startups, OpenAI and Anthropic, are disregarding publishers’ requests to cease scraping content for free model training data, contravening the robots.txt rule. TollBit exposed several AI companies’ non-compliance and alerted publishers via a letter. These AI firms, including OpenAI and Anthropic, are bypassing bots.txt to extract site content, despite public claims of adherence.
Despite claims of respect for robots.txt, OpenAI and Anthropic reputedly ignore blocks on their web crawlers, GPTBot and ClaudeBot, in contravention of website rules. OpenAI declined to comment on allegations, referencing a blog post stressing compliance with web crawler permissions. Anthropic did not respond to requests for comment.
Robots.txt, a coding tool used since the late 1990s to prevent data scraping, is being flouted by AI companies in their pursuit of valuable training data for cutting-edge generative AI models. With the allure of building powerful AI models, bot crawlers are retrieving content from websites in breach of established web policies.
Popular chatbots ChatGPT and Claude rely on vast amounts of text data scraped from the web, some of which is copyrighted. Tech firms have argued for unrestricted access to web content for AI training data. OpenAI has secured content deals with publishers, like Axel Springer, as the US Copyright Office plans to revise AI and copyright regulations.
Tech professionals with insights to share can contact Kali Hays via email or secure messaging.AI startups, OpenAI and Anthropic, are bypassing established web rules to scrape content for training their AI models, despite public claims of compliance. TollBit discovered the misconduct and alerted publishers to the AI companies’ actions. The controversy undermines the integrity of robots.txt and the informal agreements supporting it.
Read more at Business Insider: OpenAI, Anthropic Ignore Rule That Prevents Bots Scraping Web Content