OpenAI transcribed over a million hours of YouTube videos to train GPT-4
From Vox Media: 2024-04-06 16:29:51
AI companies are facing challenges in gathering high-quality training data, according to The Wall Street Journal and The New York Times. OpenAI transcribed over a million hours of YouTube videos to train its GPT-4 model, leading to legal questions. Google also gathered transcripts from YouTube, and Meta faced limits due to missed data availability.
As companies struggle with data scarcity for AI training, potential solutions like training on synthetic data or using curriculum learning are under consideration. However, using unauthorized data sources has led to legal challenges, including copyright infringement lawsuits against OpenAI. The industry faces a shortage of new content by 2028, per The Journal’s reporting.
Read more at Vox Media: OpenAI transcribed over a million hours of YouTube videos to train GPT-4