How Tech Giants Cut Corners to Harvest Data for AI

From Yahoo: 2024-04-06 11:25:45

After OpenAI faced a data shortage, they developed Whisper, a speech recognition tool to transcribe audio from YouTube videos for GPT-4 chatbot training, despite copyright concerns. Google also transcribed YouTube videos for AI models, and Meta considered buying a publishing house for AI data. Tech companies are facing data shortages and may exhaust online data as soon as 2026.

The need for massive amounts of data to train AI models has led companies like OpenAI, Google, and Meta to push boundaries, cut corners, and potentially violate copyrights. Tech companies are looking for new data sources, including using publicly available content, forming partnerships, and budgeting billions for AI integration. The use of copyrighted works in AI models has sparked lawsuits and prompted the Copyright Office to address copyright implications in the AI era.

As AI models continue to grow in scale and performance, data requirements have skyrocketed. Large language models like GPT-3 and Chinchilla are trained on vast amounts of data, generating text with high accuracy and comprising huge datasets. The need for vast data sets to train AI models has led to a race among tech companies to secure enough quality data to continue advancing the technology.



Read more at Yahoo: How Tech Giants Cut Corners to Harvest Data for AI