Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with LLMs

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with LLMs

December 1, 2024

We present NusaMT-7B, a 7-billion parameter multilingual machine translation model specifically designed for Southeast Asian languages. Despite the region being home to over 1,200 languages, existing ...

Accepted to SoLaR @ NeurIPS 2024

Authors: William Tan

We present NusaMT-7B, a 7-billion parameter multilingual machine translation model specifically designed for Southeast Asian languages. Despite the region being home to over 1,200 languages, existing translation systems provide limited support for most of them. NusaMT-7B covers 23 Southeast Asian languages, including many low-resource languages like Javanese, Sundanese, and Khmer. We introduce novel training techniques for handling low-resource language pairs and demonstrate state-of-the-art performance on the FLORES benchmark for covered languages, with particular gains for underrepresented language pairs.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.