I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID | NUMBER_ID |
---|---|
8RY68U4R | NaN |
8756ERT5 | 8759321 |
NaN | NaN |
NaN | 7896521 |
7EY68U4R | NaN |
95856ERT5 | 988888 |
NaN | NaN |
NaN | 4555555 |
Results
SERIAL_ID | NUMBER_ID |
---|---|
8RY68U4R | NaN |
8756ERT5 | 8759321 |
NaN | 7896521 |
7EY68U4R | NaN |
95856ERT5 | 988888 |
NaN | 4555555 |
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID | NUMBER_ID |
---|---|
NaN | NaN |
NaN | NaN |
NaN | NaN |
NaN | 7896521 |
NaN | NaN |
95856ERT5 | NaN |
NaN | NaN |
NaN | 4555555 |
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this. Thanks again.
I don't know why it only works for 3 columns and not for 68 originals. However, we can obtain desired output in other way.
use boolean indexing
:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]