Search code examples
pandaspandas-loc

Pandas - row selection with column operation & .loc[]


I have been practicing with pandas and have the following problem. I have been practicing with an Olympic medal data set where the countries are listed in one column, and the medals in subsequent columns.

As part of a QC check, I wanted to write a line of code that checks that the sum of the gold (01 !), silver (02 !) and bronze (03 !) medals is equal to the value in the 'total' column. I have tried this a number of ways, including .apply with the function below, and also with .loc[] (see code below). However, I keep on getting errors back. Can someone explain where I am going wrong?

To expand, the idea of the code would be to select all rows where the sum is correct, and exclude those which are not. I have shown two of the methods I have tried below.

I am new to python and coding in general so apologies if my question is not clear enough.

Many thanks

Luke

df = pd.DataFrame({Country: ['Afghanistan', 'Algeria', 'Argentina'],        
01 !: [0, 0, 1],
02 !: [5, 2, 8],
03 !: [18, 24, 28],
Total: [23, 26, 38]})

def medal_sum_check(gold, silver, bronze, total):
    if gold + silver + bronze == total:
        return True
    else:
        return False

df2 = df.loc[df.apply(medal_sum_check(df.iloc[1], df.iloc[2], df.iloc[3], df.iloc[4]))]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

or

df2 = df.loc[(df.iloc[1] + df.iloc[2] + df.iloc[3] == df.iloc[4])]

ValueError: cannot reindex from a duplicate axis

Solution

  • You can directly add do the conditional check on the entire column in pandas. Here is a demonstration:

    df = pd.DataFrame({"Country": ['Afghanistan', 'Algeria', 'Argentina'],        
    "01 !": [0, 0, 1],
    "02 !": [5, 2, 8],
    "03 !": [18, 24, 28],
    "Total": [23, 26, 38]})
    
    print(df["01 !"] + df["02 !"] + df["03 !"] == df["Total"])
    

    Output:

    0     True
    1     True
    2    False
    

    You can also add the check output as a new column in your data frame by doing this:

    df["check"] = df["01 !"] + df["02 !"] + df["03 !"] == df["Total"]
    

    Then your DataFrame will look like this:

           Country  01 !  02 !  03 !  Total  check
    0  Afghanistan     0     5    18     23   True
    1      Algeria     0     2    24     26   True
    2    Argentina     1     8    28     38  False