I'm looking for a method to delete rows that contain certain different strings in different columns at the same time.
I have a dataset like this:
A B C D
0 in summer x NaN
1 in summer y 0.0
2 out summer g 3.2
3 out winter h 4.4
4 in winter e 0.0
5 in summer y 2.6
I want to get rid of all entries that in A contain out, in B contain winter, and get rid of all NaN and 0.0 values in D
To get rid of the NaNs I used:
df.dropna(subset = ['D'])
and this to clean out winter from B
df[~df.B.str.contains('winter')]
df[~df.A.str.contains('out')] --> the above for winter will be undone
But when trying to use the upper command again to get rid of out in A, the first command acts like it never happened and the winter entries are all back, except that I sorted out the out in A. And how do I get rid of the 0.0 float values in D as well?
Desired output:
A B C D
5 in summer y 2.6
Sorry, I'm super inexperienced in Python but I need to do this for a project.
The problem with your code (df[~df.…)]
) is that it creates a temporary output,
but the original DataFrame remains unchanged.
To get your expected result:
Example:
df = df.drop(df.query("A == 'out' or B == 'winter' or D == 0").index).dropna(subset = ['D'])
Then when you print df you will get:
A B C D
5 in summer y 2.6
Run:
df = df.drop(df.query("A == 'out' or B == 'winter' or D == 0 or D != D").index)
This code relies on the fact that NaN is not equal to another NaN, so you can do it without explicit call to dropna.
Just query for the wanted rows and save the result back under df:
df = df.query("A != 'out' and B != 'winter' and D != 0 and D == D")
To see any "partial set" of rows to be dropped, you can run query with the respective partial query, e.g.:
df.query("A == 'out'")
will show rows with the "unwanted" value in A column.
If you want to print rows with "unwanted" values in e.g. 2 columns, extend the query accordingly:
df.query("A == 'out' or B == 'winter'")