algorithm search-engine information-retrieval evaluation

Calulcate Mean Average Precision and Recall for Graded Documents in Information Retrieval

I have a dataset on which I am developing a search engine. I have successfully done this. Now my next step is to calculate the performance of this search engine. I believe Mean Average Precision (MAP) and Recall (/R) are the two metrics which I need to calculate. I know the formula for these metrics, and I can calculate them if I simply label my documents as relevant or irrelevant with respect to each query. But, my label for each documents are not binary; instead those documents are graded for each query.

For example, if I have query 1, I have the dataset which says document 14 is the most relevant for this query (score: 5), document 54 is kinda relevant (score: 4), document 33 is less relevant (score: 3), and so on, ending at score 0 which implies the document is completely irrelevant and off-topic.

My question is, how can I calculate MAP/R with such kind of labelling? In other words, how can I evaluate the system if my 2nd relevant document is shown first, or my highest relevant document is shown at 10th rank, etc.?

Please understand that I must use MAP/R to evaluate my search engine.

I hope to get some direction on my doubt. Cheers!

Solution

Mean Average Precision is designed to evaluate an information retrieval system with a binary relevance function. You, on the other hand, have a graded relevance function. Therefore, you need a different method to evaluate your system.

While there have been attempts to generalize the Average Precision evaluation method to handle graded relevance, the right thing to do is to evaluate your system using The Normalized Discounted Cumulative Gain Measure.

The Normalized Discounted Cumulative Gain is designed for situations of graded notions of relevance. Like precision at k, it is evaluated over some number k of top search results. In a sense, what The Normalized Discounted Cumulative Gain does is to measure the gain of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks. See the actual formula in the link above.

If for some reason you must use MAP or Recall to evaluate your system then you'll have to modify your relevance measure so that it becomes binary (by deciding on a threshold beyond which documents be considered relevant). However, to incorporate the information encapsulated in your graded relevance measure, your only option is to use an evaluation method that supports graded relevance such as The Normalized Discounted Cumulative Gain.