Tagalog-English generic corpus



We have assembled a Tagalog-English parallel corpus comprising some 1088799 segments. These have been drawn from public resources or privately curated, and around 15% are “synthetic” data generated by a process of back-translation. This parallel corpus is suitable for training a “baseline” Neural Machine Translation system and other NLP tasks. A Transformer model trained on this dataset achieved a Bleu Score of 48.86 on an unseen testset. A TMX version of the dataset is available on request.


There are no reviews yet.

Be the first to review “Tagalog-English generic corpus”

Your email address will not be published. Required fields are marked *