Downloading pretrained Hugging Face translation models, fine-tuning them with new datasets and conversion to OpenNMT’s CTranslate2 inference engine – that seems to be the most cost- and energy-effective way to build new models for low resource language pairs where gathering data is a true treasure hunt. I’ve just fine-trained the Opus-MT Oromo-English pair. Oromo is a Cushitic language spoken by about 30 million people in Ethiopia, Kenya, Somalia and Egypt, and is the third largest language in Africa. Despite the large number of speakers, there are very few bilingual written materials in Oromo and English. I managed to pull together some three thousand new sentences from human-translated documents and fine-tuned the Opus-MT pair in both directions. This fine-tuned model has been converted into the CTranslate2 format and is now available on my free translation site at http://nmtgateway.com. The results still leave much to be desired, but the fine-tuned model could be useful at a very basic level. For the other language widely spoken in Ethiopia – Amharic, the official language with some 25 million speakers -, I managed to gather around one million sentence pairs from a variety of sources and trained models with the OpenNMT-tf framework. Again, at the level of simple sentences, like “The army delivers clean water to all the villages in the region”, the English-Amharic model generates useful if not perfect translations, and it makes a good job of a health-related sentence like “The government is introducing measures to stop the spread of the virus”. The Opus-MT Oromo<>English models were trained on the (limited) Opus data. As I found with my Tagalog<>English experiments last year, we seem to need around one million sentence pairs to get usable translations of simple sentences. The “zero-shot” road is one on which I have yet to travel!
- Posted on
- By Terence