I am trying to remove all rows that contain NaN values from a dataframe. However, I have realized that using df.dropna(axis='rows')
, but using df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index')
gives different result. The former removes fewer rows than the later.
For example, my initial dataframe has 80 cols and 91713 rows.
dropna()
the resulting dataframe has 80 cols and 91639 rows (e.g. 74 rows were dropped).drop()
the new shape is 80 cols and 56935 rows (e.g. 34778 were dropped).Is there something wrong with how I am getting the indices to input to df.drop()
? I do get 74 columns if I just look at the number of indices I am dropping with that method. E.g with df_nulls = df.iloc[np.where(df.isnull().sum()!=0)[0]]
, df_nulls.shape[0]
is 74.
Update:
I know there is definitely something wrong with the df.drop()
method because when I try to run further processing on the data I get errors related to the there still being NaNs. But why would np.where(df.isnull().sum()!=0)
not find all the NaN values?
Update 2: it is certainly just something wrong with my indexing (see below), but shouldn't iloc give the rows?
indices_rows_with_nulls = np.where(df.isnull().sum()!=0)[0]
df_nulls = df.iloc[indices_rows_with_nulls]
print('df.shape: '+ str(df.shape)+' df_nulls.shape: '+ str(df_nulls.shape))
indices_rows_without_nulls = np.where(df.isnull().sum()==0)[0]
df_no_nulls = df.iloc[indices_rows_without_nulls]
print('df.shape: '+ str(df.shape)+' df_no_nulls.shape: '+ str(df_no_nulls.shape))
gives
df.shape: (91713, 80) df_nulls.shape: (74, 80)
df.shape: (91713, 80) df_no_nulls.shape: (6, 80)
You need to sum on columns
df.isnull().sum(axis=1)!=0
# or
df.isnull().sum(axis='columns')!=0