Search code examples
python-3.xpandasprecision-recall

Calculate precision and recall based on values in two columns of a python pandas dataframe?


I have a dataframe in the following format:

Column 1 (Expected Output) | Column 2 (Actual Output)
[2,10,5,266,8]             |   [7,2,9,266]             
[4,89,34,453]              |   [4,22,34,453]

I would like to find the number of items in the actual input that were expected. For example, for row 1, only 2 and 266 were in both the expected and actual output, which means that precision = 2/5 and recall = 2/5.

Since I have over 500 rows, I would like to find some sort of formula to find the precision and recall for each row.


Solution

  • Setting up your df like this:

    df = pd.DataFrame({"Col1": [[2,10,5,266,8],[4,89,34,453]],
                       "Col2":[[7,2,9,266],[4,22,34,453]]})
    

    You can find the matching values with:

    df["matches"] = [set(df.loc[r, "Col1"]) & set(df.loc[r, "Col2"]) for r in range(len(df))]
    

    from which you can calculate precision and recall.

    But be warned that your example takes no account of the ordering of the elements in the expected output and actual output lists, and this solution will fall down if this is important, and also if there are duplicates of any values in the "Expected Output" list.