Fine-tuning: numbers game or fine art?

Nowadays you can’t visit a site on machine learning or deep learning without coming across a page or post about fine-tuning pre-trained models. What’s this all about? Well, in everyday usage, fine-tuning involves making small adjustments to a system or mechanism in order to improve its performance. Think of the guitarist tuning up his instrument before a concert. When machine translation developers  talk about fine-tuning, they are referring to the process of adjusting a machine translation model trained to translate general texts so that it can translate texts in a specific domain better than that generic model. The fine-tuning of computationally expensive pre-trained models is a key aspect of the Hugging Face philosophy.

This work is generally done by continuing the model training process using training data from the specialist field. For example, a dataset comprising parallel sentences in the field of cardiology might be used to empower a general translation model to translate cardiology texts with a reasonable degree of success. This process is also known as domain adaptation or custom machine translation. Its advantage is that we don’t need to train a machine translation system from scratch every time we want to translate documents in a new specialist field as the model parameters are taken over from the baseline model. Given the time and heavy hardware requirements involved in training a new neural machine translation system, this approach seems to represent an ideal solution. But does it always work?

In the past I have successfully used this technique to “specialise” my own Dutch-English NMT models to tackle the translation of particular sets of technical documents for industrial clients. I have recently become interested in the development of machine translation solutions for low-resource languages, particularly African languages.  The Opus-MT project provides models for a great variety of low-resource languages, as one of its stated aims is “to focus on the support of minority and low-resource languages”. The Opus-MT team at the University of Helsinki has provided over 1,000 pre-trained translation models that are free to download and use.

There is evidence that a model fine-tuned on an in-domain dataset can choose the correct translation of a technical term in that domain. The process of fine-tuning a generic English-French model to handle texts in the software domain is described in the Hugging Face course on Transformers (  I followed the instructions in this course and fine-tuned the model “opus-mt-en-fr” with the “kde4” dataset – a multilingual collection of parallel texts drawn from the manual for KDE, arguably the second-most popular Linux desktop environment after GNOME. My test sentences in the IT domain were translated more accurately after fine-tuning.

For the experiments  described in this post I picked out three African languages included in the Opus-MT project, namely Igbo, Twi and Luganda. Igbo is a member of the Volta-Niger branch of the Niger-Congo family of languages, and is spoken mainly by some 29 million people, mainly in Nigeria. Luganda, or Ganda, is a member of the Bantu branch of Niger-Congo languages, spoken by about 3 million Baganda people, who live mainly in the Buganda region in southern Uganda. Twi is a variety of Akan, a member of the Kwa sub-group of Niger-Congo languages, spoken by about 7 million Twi people, mainly in Ghana. Ganda and Igbo are available in Google Translate and all three languages are available within Facebook Research’s No Language Left Behind (NLLB) project. In view of the paucity of training data available for them, these three languages are said to be “low-resource languages”.

The aim of my  project was to take the baseline Opus-MT models for these three languages and establish whether fine-tuning them on the basis of public datasets would lead to an improvement in the performance of these models. My test sets were taken from the Facebook Research FLORES-200 evaluation set. FLORES-200 consists of translations from 842 distinct web articles, totaling 3001 sentences.  These texts have been professionally translated and it is unlikely that they have been included in the training data used for the baseline Opus-MT models. My approach involved  translating the English version (“eng-Latn”) into my chosen African languages and using the corresponding “devtest texts” as my reference translations. This would give me three baseline BLEU scores. I’m aware of the limitations of evaluating MT output solely on the basis of BLEU but considered this technique adequate for the purposes of this exercise.

I then fine-tuned these models with my chosen datasets. For Igbo there is the Ezeani English-Igbo dataset ( The English-Luganda dataset ( was created by a team of researchers from AI & Data science research Lab at Makerere University with a team of Luganda teachers, students and freelancers. The dataset for English and Akuapem Twi of 25,421 sentence pairs was built by NLPGhana ( and has been augmented with a further 26118 sentences of unknown provenance.

Well, we know that fine-tuning can equip a  model to produce a more accurate technical translation. My questions were: can fine-tuning make a poor model better, and does fine-tuning always improve a good model?  I firstly wanted to examine the effects of such variables as the size of the dataset used for fine-turning and the number of training epochs on a poorly performing “low-resource” model. As I stated above, I took the Opus-MT models for English-Igbo, English-Luganda and English-Twi as my baseline models.  The BLEU scores achieved on the evaluation set these by these models were respectively 6.0, 2.1 and 9.6.  To those used to seeing BLEU scores in the mid-sixities for high-resource language pairs, these numbers look quite dreadful, but they are not uncommon with low-resource language pairs. Could they be improved by fine-tuning with generic datasets that were not available to the Opus-MT developers? Let’s have a look at the numbers below.

ENGLISH-ROMANIAN *41.6032.6030.90
ENGLISH-FRENCH **47.7044.4043.10


Table:  BLEU scores for baseline Opus-MT models and after fine-tuning for different epochs

The biggest improvement is seen in the English-Igbo pair. The jump from 6.00 to 11.60 is achieved by fine-tuning for 20 epochs, after which the BLEU score decreases. The English-Luganda pair show a slight increase in the BLEU score – from 2.10 to 3.20 after fine-tuning for 5 epochs but this then decreases irregularly. The score achieved by  Opus-MT baseline is the highest for the English-Twi pair, and its score steadily goes down as the number of epochs increases.  The size (20-30K) and the subject matter (news and Wikipedia material) of the datasets used for fine-tuning were broadly the same. The fact that it took English-Igbo 20 epochs to reach its top score, but the top score was achieved after 5 epochs for English-Luganda might suggest that the performance of the fine-tuning is determined by the strength of the baseline model.  In the case of Opus-MT models (like Igbo and Twi) which have been trained on largely religious texts it is probably more useful to train a new model from scratch and build it up using data augmentation and backtranslation  than to fine-tune these baselines

So far I’ve spoken about poorly performing baseline models involving low-resource African languages. I also examined whether fine-tuning reasonably well performing Opus-MT models for high-resource language pairs with generic datasets would significantly increase their BLEU scores. The baseline OPUS English-Romanian model obtained a respectable 41.60 on the Flores200 test set.  I fine-tuned this model with the English-Romanian subset of the WMT16 dataset using the same basic script I used to train the African models.  The BLEU score on the test set decreased from 41.60 to 32.60 after 1 epoch and to 30.90 after 5 epochs. This suggests it is not enough to fine-tune with a good-quality dataset in broadly the same domain as baseline set to achieve an improvement in model performance.  To investigate this aspect further I took the English-French dataset(20M+ sentence pairs) built by Chris Allison-Burch ( and used a 1 million segment subset of this resource to fine-tune the Opus-MT-en-fr model for one and three epochs.  The resulting “fine-tuned” model produced a decrease in BLEU score of 3.30 and 4.60 respectively.    With these two “good” models for high-resource languages, could the decrease in BLEU score be explained by overfitting?

My conclusion after these simple – and possibly naïve – experiments is that fine-tuning is not an automatic route to a better model. It generally gives the expected results on specialized texts within a chosen domain. It can increase the BLEU score, as in the case of the fine-tuned English-Igbo OPUS model but may result in a lower BLEU score as occurred with the English-Twi model. The baseline Opus-MT models for the high-resource languages – Romanian and French –   produced scores of respectively 41.60 and 47.70 on the test set, which dropped after fine-tuning.  There seems to be no hard and fast rule in this matter. There is no single fine-tuning script that will guarantee an increase in the BLEU score. There is no single dataset that will make a poor model better.  Fine-tuning is not just a numbers game, it’s a fine art.

Example of fine-tuning script derived from a Hugging Face tutorial

import datasets

 from transformers import AutoTokenizer

from datasets import load_dataset

from transformers import DataCollatorForSeq2Seq

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

from random import randrange

lugeng_dataset = load_dataset(“csv”, data_files=”luganda-english.csv”)

lugeng_dataset = lugeng_dataset[“train”].map(lambda ex, i: {“id”: i, “translation”: dict(ex)}, remove_columns=[“lg”, “en”], features=datasets.Features({“id”: datasets.Value(“string”), “translation”: datasets

.Translation(languages=[“lg”, “en”])}), with_indices=True,)

lugeng_dataset = lugeng_dataset.train_test_split(test_size=0.2)

tokenizer = AutoTokenizer.from_pretrained(“/home/tel34/nmtgateway/Helsinki-NLP/opus-mt-lg-en”)

source_lang = “lg”

target_lang = “en”

prefix = “translate Luganda to English: “

def preprocess_function(examples):

    inputs = []

    targets = []

    for example in examples[“translation”]:

        if example[source_lang] is not None and example[target_lang] is not None and \

        len(example[source_lang].strip()) > 3 and len(example[target_lang].strip()) > 3:

            inputs.append(prefix + example[source_lang].strip())



            “There is an issue with this segment:”

            print(“Source:”, example[source_lang])

            print(“Target:”, example[target_lang])

            random_num = randrange(10000)

            print(“Replaced with”, random_num)

            inputs.append(prefix + str(random_num))


    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():

        labels = tokenizer(targets, max_length=128, truncation=True)

        model_inputs[“labels”] = labels[“input_ids”]

    return model_inputs

tokenized_lugeng =, batched=True)

model = AutoModelForSeq2SeqLM.from_pretrained(“/home/tel34/nmtgateway/Helsinki-NLP/opus-mt-lg-en”)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(











trainer = Seq2SeqTrainer(









Can we have confidence in neural machine translation?


What is neural machine translation?  Well, according to Wikipedia, “Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model”.  The stand-out phrase here is “predict the likelihood”. As a young translator, I was always under the impression that when I translated sentence A into sentence B I had to be certain that sentence B conveyed the meaning expressed in sentence A.  If I ever told my project manager that my translation was likely to convey the meaning of the original,  I would probably soon have found myself looking for a new job!

In the early days of machine translation, the translation was produced through the application of a very large number of rules. The greater the complexity of the languages involved, the more granular were the rules that governed the translation process.  In theory, if all the necessary rules were applied and all the words in the source text were contained in a bilingual custom dictionary and in a bilingual general dictionary,  you could be reasonably confident that the rule based MT system would produce an appropriate, if somewhat wooden, translation.  

Neural machine translation does not deploy a huge number of hand-crafted rules. Instead the rules enabling the model to translate from A to B, or to predict an output from an input sequence of tokens, are learned by the model from the data itself.  Having sufficient “clean data” in domains for which the neural MT system is used is half the battle when it comes to building confidence in securing an accurate translation. Of course, the model will be unable to generalise from sequences of tokens it has seen nowhere during training.  The inability of a model trained solely on the bible and other religious texts to translate simple sentences like “My child is sick, I need to see a doctor” underscores the indispensability of data pertaining to the domains or fields of human experience for which we wish to apply the model.  For developers working with “low resource” languages a lack of real-world data poses a challenge which is being met with a variety of innovative approaches.

Over the years since the appearance of rule-based MT in the early 1950s,  various metrics have been developed to measure the accuracy of machine translation systems, the best-known being the Bilingual Evaluation Understudy (BLEU) algorithm, which is probably the main starting point for developers seeking to establish just how good their systems are.   When I was a schoolboy our knowledge of Latin and Greek was put to the test by having us translate “unseen” passages from the works of classical authors.  Nobody could memorize the translations of every classical author so the test challenged us to generalise from our experience of the works of the authors on our syllabus and make a fair fist of rendering our text into English. We were expected to exercise creativity to deal with the odd unknown word in our text.   A well-trained neural machine translation model that has not “over-fitted” or simply memorized the training data will produce a varyingly successful translation of the unseen test set, as evidenced by whatever is commonly accepted as a good BLEU score.  Byte-pair encoding and other sub-word techniques will reduce the number of unknown words.  Automatic evaluation metrics such as BLEU, NIST, METEOR, WER, PER, GTM, TER and CDER help researchers and developers to determine how successfully their model has been trained.  Taking an NMT model into production in domains for which it has been trained is therefore definitely not a leap in the dark.  Then  professional translators sometimes make mistakes, and translation software makes different kinds of mistakes.  A critical eye is always needed, however the translation is produced.  To go back to our original question “Can we have confidence in neural machine translation?”,  the answer is that with reservations we probably can.

Why MyDutchPal?

You may be wondering “why MyDutchPal?” What has this business got to do with the Netherlands? Is anyone on our team Dutch?

Well, to find the answer we have to go back to 1992. Hook and Hatton, the company that owns this website, had obtained a contract to translate a huge volume of chemical specifications for the Dutch science and technology company DSM Research.  We soon realised that the set of documents comprised grammatically simple sentences that featured recurring technical terms. Our founder Terence Lewis devised a series of rules to translate these sentences from Dutch into English.  This series of rules eventually became “Trasy” the first Dutch-English machine translation program. Siemens Nederland later acquired the rights to utilise this program for their Dutch-English translations and the software was deployed to translate much of the documentation for the HSL-Zuid project – the largest railway infrastructure project in the history of the Netherlands. Of course, MyDutchPal now uses advanced neural machine translation for its Dutch-English translations which it offers as a turnkey project.

Why you need a language technology audit?

Communication is the key to successful business relationships, especially nowadays when the reality of the working world and the globalized professional market has forced a metamorphosis and language skills can no longer be taken lightly. Post-covid, physical location (city, country, or even continent) is no longer a constraint. Your employees  are in contact with  foreign colleagues and customers daily, both virtually and in person. Therefore the company’s energy should be focused on the realization, implementation, and success of the project rather than on the efforts invested and time wasted to ensure a good understanding and overcome language barriers.  

Advanced AI based language technology is a key tool for achieving these aims.   Its implementation can take the form of remote interpreting, machine translation (in the cloud or “on premises”),  enterprise chat translation or translation on a handheld device. To be effective, it is essential to diagnose the language strengths and weaknesses within the organization. A language technology audit  allows an organization to identify those areas where language technology  can assist employees in their communication with foreign colleagues and customers.  These assessments look at all a  company’s activities which involve oral or written interactions with foreign customers, partners and colleagues and highlight those areas where employees could be assisted by the use of AI based language technology. Requirements for language technology will vary from one business to the next.  A customer support organisation may benefit from a system that directly translates communications in a chat environment.  A company that needs to scan huge volumes  of data in many languages every day will be looking for a machine translation system that can process millions of words in an hour.  A scientific research organisation would be well served by a neural machine  translation system designed to translate scientific documents in a specific branch of science to a “near human quality” standard.  Requirements vary and there is an  exciting range of language technology tools to meet these requirements.  Our language technology audit is designed to identify such requirements within your organisation. You can purchase a voucher for a language technology audit in our “Knowledge Shop”.

Building machine translation models for  low resource East African languages

Downloading pretrained Hugging Face translation models, fine-tuning them with new datasets and conversion to OpenNMT’s CTranslate2 inference engine – that seems to be the most cost- and energy-effective way to build new models for low resource  language pairs where gathering data is a true treasure hunt. I’ve just fine-trained the Opus-MT Oromo-English pair. Oromo is a Cushitic language spoken by about 30 million people in Ethiopia, Kenya, Somalia and Egypt, and is the third largest language in Africa. Despite the large number of speakers, there are very few bilingual written materials in Oromo and English. I managed to pull together some three thousand new sentences from human-translated documents and fine-tuned the Opus-MT pair in both directions. This fine-tuned model has been converted into the CTranslate2 format and is now available on my free translation site at The results still leave much to be desired, but the fine-tuned model could be useful at a very basic level. For the other language widely spoken in Ethiopia – Amharic, the official language with some 25 million speakers -, I managed to gather around one million sentence pairs from a variety of sources and trained models with the OpenNMT-tf framework. Again, at the level of simple sentences, like “The army delivers clean water to all the villages in the region”, the English-Amharic model generates useful if not perfect translations, and it makes a good job of a health-related sentence like “The government is introducing measures to stop the spread of the virus”. The Opus-MT Oromo<>English models were trained on the (limited) Opus data. As I found with my Tagalog<>English experiments last year, we seem to need around one million sentence pairs to get usable translations of simple sentences. The “zero-shot” road is one on which I have yet to travel!