nlp precision-recall machine-translation

Best evaluation method for real-time machine translation?

I'm aware that there are many different methods like BLEU, NIST, METEOR etc. They all have their pros and cons, and their effectiveness differs from corpus to corpus. I'm interested in real-time translation, so that two people could have a conversation by typing out a couple sentences at a time and having it immediately translated.

What kind of corpus would this count as? Would the text be considered too short for proper evaluation by most conventional methods? Would the fact that the speaker is constantly switching make the context more difficult?

Solution

What you are asking for, belongs to the domain of Confidence Estimation, nowadays (within the Machine Translation (MT) community) better known as Quality Estimation, i.e. "assigning a score to MT output without access to a reference translation".

For MT evaluation (using BLEU, NIST or METEOR) you need:

A hypothesis translation (MT output)
A reference translation (from a test set)

In your case (real-time translation), you do not have (2). So you will have to estimate the performance of your system, based on features of your source sentence and your hypothesis translation, and on the knowledge you have about the MT process.

A baseline system with 17 features is described in:

Specia, L., Turchi, M., Cancedda, N., Dymetman, M., & Cristianini, N. (2009b). Estimating the sentence level quality of machine translation systems. 13th Conference of the European Association for Machine Translation, (pp. 28-37)
Which you can find here

Quality Estimation is an active research topic. The most recent advances can be followed on the websites of the WMT Conferences. Look for the Quality Estimation shared tasks, for example http://www.statmt.org/wmt17/quality-estimation-task.html