OpenAI's GPT-4 Trained Using YouTube Videos: Details
OpenAI's unveiling of the GPT-4, touted as its most robust large language model to date, marked a significant leap in AI capabilities. The model's proficiency was showcased through impressive scores on various exams, indicating its advanced linguistic understanding. However, recent revelations shed light on OpenAI's unconventional training approach.
A report by The New York Times disclosed that OpenAI encountered data scarcity while developing its Whisper audio transcription model. To address this challenge, the company transcribed over a million hours of YouTube videos to train the GPT-4 language model, a move with legal ambiguities.
OpenAI President Greg Brockman reportedly spearheaded the sourcing of these videos, signalling the company's innovative methods in dataset acquisition. With conventional data sources depleted by 2021, discussions ensued about leveraging YouTube videos, podcasts, and audiobooks for training purposes.
In response to inquiries, OpenAI's spokesperson, Lindsay Held, emphasized the company's commitment to curating diverse datasets for each model to enhance its comprehension and competitiveness. Held also highlighted the utilization of public data and partnerships and explored synthetic data generation.
To recall, OpenAI's blog post introducing GPT-4 read, "We've created GPT-4, the latest milestone in OpenAI's effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks."
While speculations about the emergence of GPT-5 have circulated, OpenAI has not provided official confirmation regarding its launch timeline. Nonetheless, CEO Sam Altman has expressed ambitions for even more potent language models in the future, underscoring OpenAI's relentless pursuit of AI advancement.