[arXiv] Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation

Prathyusha Jwalapuram, Barbara Rychalska, Shafiq Joty, Dominika Basaj

Despite increasing instances of machine translation systems including contextual information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. It can be partly attributed to the shortcomings of MT evaluation methods: popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several baseline MT systems on the datasets. We find that context-aware models do not improve discourse-related translations consistently across languages and phenomena. We aim to maintain a leaderboard for future MT systems to demonstrate their competence at translating discourse phenomena.

Read Full Paper

[EMNLP 2019] Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite

[EMNLP 2020] Pronoun-Targeted Fine-tuning for NMT with Hybrid Losses

Prathyusha Jwalapuram

[arXiv] Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation

Previous Article

Next Article

Leave a Reply Cancel reply