python metrics information-retrieval evaluation precision-recall

Evaluating results from search query in python: ranked list vs. one manually labeled correct document

Given the following predicted ranked-list of documents:

query1_predicted = [1381, 1637, 646, 1623, 774, 1764, 92, 12, 642, 463, 613, ...]

and this manually marked best choice:

query1_manual = 646

Is there any suitable metric from information retrieval already implemented in python to rank this result?

I do not think that NDCG works for me because I am missing the true and fully ranked list of documents. I assume recall, precision, F-score and MAP also won't work as long as I don't have a full list of manually ranked results per query indicating the relevance of a document.

By the way: The length of the predicted list equals total number of documents in my collection:

len(query1_predicted) = len(documents)

Thanks for the help in advance!

Solution

An idea is to combine the precision and recall metrics. For example if your query returns a list where the correct document is first you can say that your precision and recall is 100%. If it is on the second place you have again 100% precision but your recall falls to 50% and so on. I know this approach is not perfect but it gives a good insight of your results with well known metrics.