Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

Probing Audio-Generation Capabilities of Text-Based Language Models

Probing Audio-Generation Capabilities of Text-Based Language Models

December 1, 2025

This research investigates how textual representation of audio relates to LLMs' learning about the audio world, and explores the extent to which LLMs can be prompted to generate audio, despite their p...

Accepted to NAACL SRW 2025

Authors: Arjun Prasaath A, Ujjwal Kaur, Parteek Kumar, Souritra K.

This research investigates how textual representation of audio relates to LLMs' learning about the audio world, and explores the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases, suggesting that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.