Search code examples
pandasdata-cleaning

Function using df.drop to remove multiple rows


I want to create a function that will let me input a list of values and will remove any rows that contain the values in a given column. I will use the following data frame as an example:

data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
        'Age': [20, 21, 19, 18]}
sample = pd.DataFrame(data)

I would like to remove any rows containing the following values in the 'Age' column.

remove_these = [20,21]

This is what I have so far:

def rem_out (df,column,x):
    df.drop(df[df['column'] == x].index, inplace = True)
    return df

In my function 'df' refers to the data frame, 'column' is the name of the column that should be checked for the values, and 'x' is the list of values. It is very important that I be able to give my function a list of values because I will be removing hundreds of values from my data.

When I run my function like this:

rem_out(sample, Age, remove_these)

I get an error saying that Age is not defined. How can I specify the column of interest so that the values in my list can be removed from the data frame?

Ideally, my function would have removed the first and second rows.


Solution

  • There are ~3 issues, 2 of them are due to variable versus string distinction. In Python, if you write a "bare" word, it's either a keyword like def, else etc. or a name to refer to a function, variable etc. In your case:

    def rem_out (df,column,x):
        df.drop(df[df['column'] == x].index, inplace = True)
        return df
    

    Here the column is a name that refers to what's passed to the function. "column", however, is the literal string "column". So, whatever you pass to the function is ignored, and instead a column named "column" is sought, which is undesired. So need to remove quotes there.

    rem_out(sample, Age, remove_these)
    

    Here, rem_out, sample and remove_these are "bare" and in fact refer to a function, DataFrame and a list, respectively; all fine. But Age is also bare and Python will look for something that was already named Age whereas you need literally the string "Age" to look as a column.

    Lastly,

    df[column] == x
    

    will look equality in the column against x which is a list but that's not desired; you want to know if the column values are in that list instead of being entirely equal to that list each. So, you need .isin there.

    Overall:

    def rem_out(df, column, to_remove):
        return df.drop(df[df[column].isin(to_remove)].index)
    
    new = rem_out(sample, "Age", remove_these)
    

    should do the trick. Also removed is the inplace=True argument pair, as it's rarely useful if ever. With this change, a new dataframe is returned from .drop, which is in turn returned by the function rem_out and assigned in the calling site back.