YuE AI Technical Overview

YuE is a groundbreaking open-source model designed to generate complete songs from lyrics, a task known as lyrics2song. While many text-based music generation models excel at producing short clips of instrumental music, creating full-length songs (up to 5 minutes) with both vocals and instrumental accompaniment has remained a significant challenge. This is due to several factors: the long context required for music, the complexity of musical signals compared to speech or sound effects, the distortion in linguistic content during singing, and the lack of parallel lyrics-audio data. To address these challenges, YuE incorporates several innovative techniques. First, it uses a semantically enhanced audio tokenizer to reduce training costs and speed up convergence. Second, a unique dual-token technique enables synchronized modeling of vocals and instrumentals without altering the existing llama decoder architecture, ensuring scalability and ease of deployment. Third, the model employs a "lyrics-chain-of-thought" approach, allowing it to progressively generate entire songs while following the lyrics. Finally, a three-stage training scheme ensures the output is scalable, musically coherent, and lyrically aligned. With these advancements, YuE achieves high-quality, full-length music generation, producing captivating vocal melodies, coherent musical structure, and fitting instrumental accompaniment—all while following the provided lyrics.