One Million Hours Of YouTube Video Transcribed By OpenAI To Train GPT-4

From India Times: 2024-04-08 02:47:57

AI businesses are facing challenges in collecting high-quality training data, leading to engagement in activities that raise legal issues under AI copyright law, as reported by The Wall Street Journal and The New York Times.

OpenAI transcribed over a million hours of YouTube videos to train its powerful GPT-4 language model, believing it was fair usage despite legal concerns, as reported by The New York Times.

OpenAI is considering creating its own synthetic data and curates unique datasets for each model to enhance understanding of the world and maintain research competitiveness, according to an email from OpenAI spokesperson Lindsay Held to The Verge.

To address a shortage of valuable data in 2021, OpenAI considered transcribing YouTube videos, podcasts, and audiobooks after using data sources like Github code and Quizlet homework content, as detailed by The New York Times.



Read more at India Times: One Million Hours Of YouTube Video Transcribed By OpenAI To Train GPT-4