Search code examples
pythonpandasdataframenanmissing-data

How to delete the DataFrame rows with the largest number of NaNs?


Pandas and other question/answers in this site provide solutions for the case when we know the number of non NaN to preserve. How can I efficiently delete just the worst row, or rows if there are more than one being the worst ones. Some examples below show how to remove columns, could be rows by setting the axis. However we need to specify how many non NaNs to keep.

>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

>>> df.dropna(thresh=3, axis=1)
     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

Or to delete them altogether:

>>> df.dropna(axis=1)
   C
0  1
1  1
2  1
3  1

Notice I give more context below. While a hint to a specific solution for that is welcome, I prefer an answer regarding the general case as stated in the title of the post.

Context I am looking for an effficient way to remove the row with the largest amount of NaNs (or remove the rows if there are ties at the largest number), and after that remove the column(s) analogously, so that I can do repeat these two steps until all NaNs are removed. The goal is to remove NaNs preserving the maximum possible amount of data keeping the table consistent, i.e., only entire row/column removal is allowed. Please read the notice above.

Examples above extracted from this answer: https://stackoverflow.com/a/68306367/9681577


Solution

  • You can use boolean indexing with the count of NaNs:

    # count the number of NaNs per row
    s = df.isna().sum(axis=1)
    
    # drop rows which have the max number, if > 0
    out = df[~(s.eq(s.max()) & s.gt(0))]
    

    De Morgan's equivalence:

    out = df[s.ne(s.max()) | s.eq(0)]
    

    Output:

         A    B  C    D
    1  1.0  1.0  1  1.0
    2  1.0  NaN  1  1.0
    3  NaN  1.0  1  1.0
    

    columns

    Identical logic on the other axis:

    s = df.isna().sum(axis=0)
    
    out = df.loc[:, s.ne(s.max()) | s.eq(0)]
    

    Output:

         A  C    D
    0  1.0  1  NaN
    1  1.0  1  1.0
    2  1.0  1  1.0
    3  NaN  1  1.0