Google Research Team Builds Practical Machine Translation Systems for 1000+ Languages – Synced

Synced
AI Technology & Industry Review
Synced

56 Temperance St, #700
Toronto, ON M5H 3V5
In the new paper Building Machine Translation Systems for the Next Thousand Languages, a Google Research team proposes a practical machine translation (MT) system that can translate over one thousand languages, including both high-resource and low-resource languages.
Driven by the development of powerful machine learning models and the availability of large-scale web-mined datasets, the performance of academic and commercial machine translation (MT) systems has significantly improved in recent years. These systems however are generally restricted to fewer than 100 mainstream languages, a small fraction of the over 7000+ languages currently spoken globally.
In his influential 2004 Wired article The Long Tail and subsequent book, Chris Anderson argues that the combined appeal of many niche products could eclipse that of the bestselling books and blockbuster movies that dominate the market. The ever-deepening libraries of today’s online booksellers and music and video streaming platforms seem to have confirmed this. Could we see a similar trend emerging in MT?
A Google Research team takes inspiration from the long-tail theory in their new paper Building Machine Translation Systems for the Next Thousand Languages, which proposes a practical MT system that can translate over 1,000 languages.
The team summarizes their study’s aims and contributions as:
The team notes that despite high speaker populations, many languages spoken in Africa, South and South-East Asia and indigenous languages of the Americas remain relatively under-served by today’s MT systems, which tend to focus on European tongues. Google Translate, for example, supports Maltese, Icelandic, and Corsican, each with fewer than 1M Level 1 speakers, but not Bhojpuri (~51M speakers), Oromo (~24M speakers), or Quechua (~9M speakers).
To address this underrepresentation, the researchers first build monolingual web text corpora for such languages. They scale LangID (language identification) models to 1000+ languages by leveraging traditional n-gram models and semi-supervised learning approaches, then use these LangID models to identify and extract long-tail (aka low-resource) language data from the web.
With this mined monolingual data at hand, the team then builds general-domain MT models by exploiting the parallel data available for higher resource languages. The team refers to this setup as zero-resource since no direct supervision is available for the long-tail languages. To boost the quality of zero-resource translation for long-tail languages, the researchers leverage recently developed MT techniques such as self-supervised learning from monolingual data, massively multilingual supervised training, large-scale back-translation and self-training, and high capacity models.
In their empirical study, the researchers used their models to translate English sentences into 38 long-tail languages to evaluate its zero-resource translation capability and measured performance using the character-level chrF metric (Popovic, 2015) and human evaluations. They observed significant quality improvements, confirming the effectiveness of their proposed approach for building practical and effective MT systems for long-tail languages.
The paper Building Machine Translation Systems for the Next Thousand Languages is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Machine Intelligence | Technology & Industry | Information & Analysis
Your email address will not be published.











Synced

56 Temperance St, #700
Toronto, ON M5H 3V5

One Broadway, 14th Floor, Cambridge, MA 02142

75 E Santa Clara St, 6th Floor, San Jose, CA 95113
Contact Us @ global.general@jiqizhixin.com
Visit Us @ Synced China
Contribute to Synced Review
 

source
Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published.

© 2022 AI Caosuo - Proudly powered by theme Octo