Fine-tuning: numbers game or fine art?

Nowadays you can’t visit a site on machine learning or deep learning without coming across a page or post about fine-tuning pre-trained models. What’s this all about? Well, in everyday usage, fine-tuning involves making small adjustments to a system or mechanism in order to improve its performance. Think of the guitarist tuning up his instrument before a concert. When machine translation developers  talk about fine-tuning, they are referring to the process of adjusting a machine translation model trained to translate general texts so that it can translate texts in a specific domain better than that generic model. The fine-tuning of computationally expensive pre-trained models is a key aspect of the Hugging Face philosophy.

This work is generally done by continuing the model training process using training data from the specialist field. For example, a dataset comprising parallel sentences in the field of cardiology might be used to empower a general translation model to translate cardiology texts with a reasonable degree of success. This process is also known as domain adaptation or custom machine translation. Its advantage is that we don’t need to train a machine translation system from scratch every time we want to translate documents in a new specialist field as the model parameters are taken over from the baseline model. Given the time and heavy hardware requirements involved in training a new neural machine translation system, this approach seems to represent an ideal solution. But does it always work?

In the past I have successfully used this technique to “specialise” my own Dutch-English NMT models to tackle the translation of particular sets of technical documents for industrial clients. I have recently become interested in the development of machine translation solutions for low-resource languages, particularly African languages.  The Opus-MT project provides models for a great variety of low-resource languages, as one of its stated aims is “to focus on the support of minority and low-resource languages”. The Opus-MT team at the University of Helsinki has provided over 1,000 pre-trained translation models that are free to download and use.

There is evidence that a model fine-tuned on an in-domain dataset can choose the correct translation of a technical term in that domain. The process of fine-tuning a generic English-French model to handle texts in the software domain is described in the Hugging Face course on Transformers (  I followed the instructions in this course and fine-tuned the model “opus-mt-en-fr” with the “kde4” dataset – a multilingual collection of parallel texts drawn from the manual for KDE, arguably the second-most popular Linux desktop environment after GNOME. My test sentences in the IT domain were translated more accurately after fine-tuning.

For the experiments  described in this post I picked out three African languages included in the Opus-MT project, namely Igbo, Twi and Luganda. Igbo is a member of the Volta-Niger branch of the Niger-Congo family of languages, and is spoken mainly by some 29 million people, mainly in Nigeria. Luganda, or Ganda, is a member of the Bantu branch of Niger-Congo languages, spoken by about 3 million Baganda people, who live mainly in the Buganda region in southern Uganda. Twi is a variety of Akan, a member of the Kwa sub-group of Niger-Congo languages, spoken by about 7 million Twi people, mainly in Ghana. Ganda and Igbo are available in Google Translate and all three languages are available within Facebook Research’s No Language Left Behind (NLLB) project. In view of the paucity of training data available for them, these three languages are said to be “low-resource languages”.

The aim of my  project was to take the baseline Opus-MT models for these three languages and establish whether fine-tuning them on the basis of public datasets would lead to an improvement in the performance of these models. My test sets were taken from the Facebook Research FLORES-200 evaluation set. FLORES-200 consists of translations from 842 distinct web articles, totaling 3001 sentences.  These texts have been professionally translated and it is unlikely that they have been included in the training data used for the baseline Opus-MT models. My approach involved  translating the English version (“eng-Latn”) into my chosen African languages and using the corresponding “devtest texts” as my reference translations. This would give me three baseline BLEU scores. I’m aware of the limitations of evaluating MT output solely on the basis of BLEU but considered this technique adequate for the purposes of this exercise.

I then fine-tuned these models with my chosen datasets. For Igbo there is the Ezeani English-Igbo dataset ( The English-Luganda dataset ( was created by a team of researchers from AI & Data science research Lab at Makerere University with a team of Luganda teachers, students and freelancers. The dataset for English and Akuapem Twi of 25,421 sentence pairs was built by NLPGhana ( and has been augmented with a further 26118 sentences of unknown provenance.

Well, we know that fine-tuning can equip a  model to produce a more accurate technical translation. My questions were: can fine-tuning make a poor model better, and does fine-tuning always improve a good model?  I firstly wanted to examine the effects of such variables as the size of the dataset used for fine-turning and the number of training epochs on a poorly performing “low-resource” model. As I stated above, I took the Opus-MT models for English-Igbo, English-Luganda and English-Twi as my baseline models.  The BLEU scores achieved on the evaluation set these by these models were respectively 6.0, 2.1 and 9.6.  To those used to seeing BLEU scores in the mid-sixities for high-resource language pairs, these numbers look quite dreadful, but they are not uncommon with low-resource language pairs. Could they be improved by fine-tuning with generic datasets that were not available to the Opus-MT developers? Let’s have a look at the numbers below.

ENGLISH-ROMANIAN *41.6032.6030.90
ENGLISH-FRENCH **47.7044.4043.10


Table:  BLEU scores for baseline Opus-MT models and after fine-tuning for different epochs

The biggest improvement is seen in the English-Igbo pair. The jump from 6.00 to 11.60 is achieved by fine-tuning for 20 epochs, after which the BLEU score decreases. The English-Luganda pair show a slight increase in the BLEU score – from 2.10 to 3.20 after fine-tuning for 5 epochs but this then decreases irregularly. The score achieved by  Opus-MT baseline is the highest for the English-Twi pair, and its score steadily goes down as the number of epochs increases.  The size (20-30K) and the subject matter (news and Wikipedia material) of the datasets used for fine-tuning were broadly the same. The fact that it took English-Igbo 20 epochs to reach its top score, but the top score was achieved after 5 epochs for English-Luganda might suggest that the performance of the fine-tuning is determined by the strength of the baseline model.  In the case of Opus-MT models (like Igbo and Twi) which have been trained on largely religious texts it is probably more useful to train a new model from scratch and build it up using data augmentation and backtranslation  than to fine-tune these baselines

So far I’ve spoken about poorly performing baseline models involving low-resource African languages. I also examined whether fine-tuning reasonably well performing Opus-MT models for high-resource language pairs with generic datasets would significantly increase their BLEU scores. The baseline OPUS English-Romanian model obtained a respectable 41.60 on the Flores200 test set.  I fine-tuned this model with the English-Romanian subset of the WMT16 dataset using the same basic script I used to train the African models.  The BLEU score on the test set decreased from 41.60 to 32.60 after 1 epoch and to 30.90 after 5 epochs. This suggests it is not enough to fine-tune with a good-quality dataset in broadly the same domain as baseline set to achieve an improvement in model performance.  To investigate this aspect further I took the English-French dataset(20M+ sentence pairs) built by Chris Allison-Burch ( and used a 1 million segment subset of this resource to fine-tune the Opus-MT-en-fr model for one and three epochs.  The resulting “fine-tuned” model produced a decrease in BLEU score of 3.30 and 4.60 respectively.    With these two “good” models for high-resource languages, could the decrease in BLEU score be explained by overfitting?

My conclusion after these simple – and possibly naïve – experiments is that fine-tuning is not an automatic route to a better model. It generally gives the expected results on specialized texts within a chosen domain. It can increase the BLEU score, as in the case of the fine-tuned English-Igbo OPUS model but may result in a lower BLEU score as occurred with the English-Twi model. The baseline Opus-MT models for the high-resource languages – Romanian and French –   produced scores of respectively 41.60 and 47.70 on the test set, which dropped after fine-tuning.  There seems to be no hard and fast rule in this matter. There is no single fine-tuning script that will guarantee an increase in the BLEU score. There is no single dataset that will make a poor model better.  Fine-tuning is not just a numbers game, it’s a fine art.

Example of fine-tuning script derived from a Hugging Face tutorial

import datasets

 from transformers import AutoTokenizer

from datasets import load_dataset

from transformers import DataCollatorForSeq2Seq

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

from random import randrange

lugeng_dataset = load_dataset(“csv”, data_files=”luganda-english.csv”)

lugeng_dataset = lugeng_dataset[“train”].map(lambda ex, i: {“id”: i, “translation”: dict(ex)}, remove_columns=[“lg”, “en”], features=datasets.Features({“id”: datasets.Value(“string”), “translation”: datasets

.Translation(languages=[“lg”, “en”])}), with_indices=True,)

lugeng_dataset = lugeng_dataset.train_test_split(test_size=0.2)

tokenizer = AutoTokenizer.from_pretrained(“/home/tel34/nmtgateway/Helsinki-NLP/opus-mt-lg-en”)

source_lang = “lg”

target_lang = “en”

prefix = “translate Luganda to English: “

def preprocess_function(examples):

    inputs = []

    targets = []

    for example in examples[“translation”]:

        if example[source_lang] is not None and example[target_lang] is not None and \

        len(example[source_lang].strip()) > 3 and len(example[target_lang].strip()) > 3:

            inputs.append(prefix + example[source_lang].strip())



            “There is an issue with this segment:”

            print(“Source:”, example[source_lang])

            print(“Target:”, example[target_lang])

            random_num = randrange(10000)

            print(“Replaced with”, random_num)

            inputs.append(prefix + str(random_num))


    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():

        labels = tokenizer(targets, max_length=128, truncation=True)

        model_inputs[“labels”] = labels[“input_ids”]

    return model_inputs

tokenized_lugeng =, batched=True)

model = AutoModelForSeq2SeqLM.from_pretrained(“/home/tel34/nmtgateway/Helsinki-NLP/opus-mt-lg-en”)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(











trainer = Seq2SeqTrainer(