Search code examples
pythonpandasdataframedata-cleaning

Different results with df.dropna(axis='rows') and df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index')


I am trying to remove all rows that contain NaN values from a dataframe. However, I have realized that using df.dropna(axis='rows'), but using df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index') gives different result. The former removes fewer rows than the later.

For example, my initial dataframe has 80 cols and 91713 rows.

  • If I use dropna() the resulting dataframe has 80 cols and 91639 rows (e.g. 74 rows were dropped).
  • If I instead use drop() the new shape is 80 cols and 56935 rows (e.g. 34778 were dropped).

Is there something wrong with how I am getting the indices to input to df.drop()? I do get 74 columns if I just look at the number of indices I am dropping with that method. E.g with df_nulls = df.iloc[np.where(df.isnull().sum()!=0)[0]], df_nulls.shape[0] is 74.

Update: I know there is definitely something wrong with the df.drop() method because when I try to run further processing on the data I get errors related to the there still being NaNs. But why would np.where(df.isnull().sum()!=0) not find all the NaN values?

Update 2: it is certainly just something wrong with my indexing (see below), but shouldn't iloc give the rows?

indices_rows_with_nulls = np.where(df.isnull().sum()!=0)[0] 
df_nulls = df.iloc[indices_rows_with_nulls] 
print('df.shape: '+ str(df.shape)+'   df_nulls.shape: '+ str(df_nulls.shape))
indices_rows_without_nulls = np.where(df.isnull().sum()==0)[0] 
df_no_nulls = df.iloc[indices_rows_without_nulls]
print('df.shape: '+ str(df.shape)+'   df_no_nulls.shape: '+ str(df_no_nulls.shape))

gives

df.shape: (91713, 80)   df_nulls.shape: (74, 80)
df.shape: (91713, 80)   df_no_nulls.shape: (6, 80)

Solution

  • You need to sum on columns

    df.isnull().sum(axis=1)!=0
    
    # or
    
    df.isnull().sum(axis='columns')!=0