Accepted to SoLaR @ NeurIPS 2024
Authors: William Tan
We present NusaMT-7B, a 7-billion parameter multilingual machine translation model specifically designed for Southeast Asian languages. Despite the region being home to over 1,200 languages, existing translation systems provide limited support for most of them. NusaMT-7B covers 23 Southeast Asian languages, including many low-resource languages like Javanese, Sundanese, and Khmer. We introduce novel training techniques for handling low-resource language pairs and demonstrate state-of-the-art performance on the FLORES benchmark for covered languages, with particular gains for underrepresented language pairs.

