We have assembled a Tagalog-English parallel corpus comprising some 1088799 segments. These have been drawn from public resources or privately curated, and around 15% are “synthetic” data generated by a process of back-translation. This parallel corpus is suitable for training a “baseline” Neural Machine Translation system and other NLP tasks. A Transformer model trained on this dataset achieved a Bleu Score of 48.86 on an unseen testset. A TMX version of the dataset is available on request.
Reviews
There are no reviews yet.