Search code examples
meaninformation-retrievalcontent-based-retrievalaverage-precision

Confusion about (Mean) Average Precision


In this question I asked clarifications about the precision-recall curve.

In particular, I asked if we have to consider a fixed number of rankings to draw the curve or we can reasonably choose ourselves. According to the answer, the second one is correct.

However now I have a big doubt about the Average Precision (AP) value: AP is used to estimate numerically how good is our algorithm given a certain query. Mean Average Precision (MAP) is average precision on multiple queries.

My doubt is: if AP changes according to how many objects we retrieve then we can tune this parameter to our advantage so we show the best AP value possible. For example, supposing that the p-r curve performs wonderfully until 10 elements and then horribly, we could "cheat" computing the (M)AP value considering only the first 10 elements.

I know that this could sound confusing, but I didn't find anything about this anywhere.


Solution

  • AP is the area under the precision-recall curve, and the precision-recall curve is supposed to be computed over the entire returned ranked list.

    It is not possible to cheat the AP by tweaking the size of the returned ranked list. AP is the area below the precision-recall curve which plots precision as a function of recall, where recall is the number of returned positives relative to the total number of positives that exist in the ground truth, not relative to the number of positives in the returned list. So if you crop the list, all you are doing is that you are cropping the precision-recall curve and ignoring to plot its tail. As AP is the area under the curve, cropping the list reduces the AP, so there is no wisdom in tweaking the ranked list size - the maximal AP is achieved if you return the entire list. You can see this for example from the code you cited in your other question - cropping the list simply corresponds to

    for ( ; i<ranked_list.size(); ++i) {
    

    changing to

    for ( ; i<some_number; ++i) {
    

    which results in fewer increments of ap (all increments are non-negative as old_precision and precision are non-negative and recall is non-decreasing) and thus smaller AP value.

    In practice, for purely computational reasons, you might want to crop the list at some reasonable number, e.g. 10k, as it is unlikely that AP will change much since precision@large_number is likely to be 0 unless you have an unusually large number of positives.

    Your confusion might be related to the way some popular function, such as VLFeat's vl_pr compute the precision-recall curves as they assume that you've provided them the entire ranked list and therefore compute the total number of positives in the ground truth by just looking at the ranked list instead of the ground truth itself. So if you used vl_pr naively on cropped lists you could indeed cheat it, but that would be an invalid computation. I agree it's not 100% clear from the description of the function, but if you examine the documentation in more detail, you'll see it mentions NUMNEGATIVES and NUMPOSITIVES, so that if you are giving an incomplete ranked list you should set these two quantities to let the function know how to compute the precision-recall curve / AP properly. Now if you plot different crops of a ranked list using vl_pr but with the same NUMNEGATIVES and NUMPOSITIVES for all function calls, you'll see that the precision-recall curves are just crops of each other, as I was explaining above (I haven't checked this yet as I don't have matlab here, but I'm certain it's the case and if it's not we should file a bug).