Search code examples
pythonpandasdataframerows

Pandas - Deleting rows with string a in column x and string b in column y


I'm looking for a method to delete rows that contain certain different strings in different columns at the same time.

I have a dataset like this:

    A       B      C       D
0  in    summer    x      NaN
1  in    summer    y      0.0
2  out   summer    g      3.2
3  out   winter    h      4.4
4  in    winter    e      0.0
5  in    summer    y      2.6

I want to get rid of all entries that in A contain out, in B contain winter, and get rid of all NaN and 0.0 values in D

To get rid of the NaNs I used:

df.dropna(subset = ['D'])

and this to clean out winter from B

df[~df.B.str.contains('winter')] 
df[~df.A.str.contains('out')] --> the above for winter will be undone

But when trying to use the upper command again to get rid of out in A, the first command acts like it never happened and the winter entries are all back, except that I sorted out the out in A. And how do I get rid of the 0.0 float values in D as well?

Desired output:

    A       B      C       D
5  in    summer    y      2.6

Sorry, I'm super inexperienced in Python but I need to do this for a project.


Solution

  • The problem with your code (df[~df.…)]) is that it creates a temporary output, but the original DataFrame remains unchanged.

    To get your expected result:

    • run drop passing indices of all rows which you can select by comparison with any fixed values,
    • then, to filter out NaN in D column, run dropna, as in your code, but this should be a chained statement,
    • save the result back under df.

    Example:

    df = df.drop(df.query("A == 'out' or B == 'winter' or D == 0").index).dropna(subset = ['D'])
    

    Then when you print df you will get:

        A       B  C    D
    5  in  summer  y  2.6
    

    Another option

    Run:

    df = df.drop(df.query("A == 'out' or B == 'winter' or D == 0 or D != D").index)
    

    This code relies on the fact that NaN is not equal to another NaN, so you can do it without explicit call to dropna.

    Yet another option

    Just query for the wanted rows and save the result back under df:

    df = df.query("A != 'out' and B != 'winter' and D != 0 and D == D")
    

    To see any "partial set" of rows to be dropped, you can run query with the respective partial query, e.g.:

    df.query("A == 'out'")
    

    will show rows with the "unwanted" value in A column.

    If you want to print rows with "unwanted" values in e.g. 2 columns, extend the query accordingly:

    df.query("A == 'out' or B == 'winter'")