Search code examples
multilingualmetricsmachine-translation

What's a Good Machine Translation Metric or Gold Set


I'm starting up looking into doing some machine translation of search queries, and have been trying to think of different ways to rate my translation system between iterations and against other systems. The first thing that comes to mind is getting translations of a set of search terms from mturk from a bunch of people and saying each is valid, or something along those lines, but that would be expensive, and possibly prone to people putting in bad translations.

Now that I'm trying to think of something cheaper or better, I figured I'd ask StackOverflow for ideas, in case there's already some standard available, or someone has tried to find one of these before. Does anyone know, for example, how Google Translate rates various iterations of their system?


Solution

  • I'd suggest refining your question. There are a great many metrics for machine translation, and it depends on what you're trying to do. In your case, I believe the problem is simply stated as: "Given a set of queries in language L1, how can I measure the quality of the translations into L2, in a web search context?"

    This is basically cross-language information retrieval.

    What's important to realize here is that you don't actually care about providing the user with the translation of the query: you want to get them the results that they would have gotten from a good translation of the query.

    To that end, you can simply measure the discrepancy of the results lists between a gold translation and the result of your system. There are many metrics for rank correlation, set overlap, etc., that you can use. The point is that you need not judge each and every translation, but just evaluate whether the automatic translation gives you the same results as a human translation.

    As for people proposing bad translations, you can assess whether the putative gold standard candidates have similar results lists (i.e. given 3 manual translations do they agree in results? If not, use the 2 that most overlap). If so, then these are effectively synonyms from the IR perspective.