Search code examples
pythonpandasdataframedelete-row

python delete row where most columns are nans


I'm importing data where from excel where some rows may have notes in a column and are not truly part of the dataframe. dummy Eg. below:

H1            H2           H3
*highlighted cols are PII
sam          red           5
pam          blue          3
rod          green         11
 * this is the end of the data

When the above file is imported into dfPA it looks like:

dfPA:
Index     H1            H2           H3
1         *highlighted cols are PII
2         sam          red           5
3         pam          blue          3
4         rod          green         11
5         * this is the end of the data

I want to delete the first and last row. This is what I've done.

#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3

#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output: 
0     2
1     3
2     3
4     3
5     2
(where a is a series)

#convert a from series to df
dfa = a.to_frame()

#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
    if (cntcols - dfa.iloc[r][0]) > n:
        i = row.name
        dfPA = dfPA.drop(index=i)

This doesn't work. Is there way to do this?


Solution

  • You should use the pandas.DataFrame.dropna method. It has a thresh parameter that you can use to define a minimum number of NaN to drop the row/column.

    Imagine the following dataframe:

    >>> import numpy as np
    >>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
    
         A    B  C    D
    0  1.0  NaN  1  NaN
    1  1.0  1.0  1  1.0
    2  1.0  NaN  1  1.0
    3  NaN  1.0  1  1.0
    

    You can drop columns with NaN using:

    >>> df.dropna(axis=1)
    
       C
    0  1
    1  1
    2  1
    3  1
    

    The thresh parameter defines the minimum number of non-NaN values to keep the column:

    >>> df.dropna(thresh=3, axis=1)
    
         A  C    D
    0  1.0  1  NaN
    1  1.0  1  1.0
    2  1.0  1  1.0
    3  NaN  1  1.0
    

    If you want to reason in terms of the number of NaN:

    # example for a minimum of 2 NaN to drop the column
    >>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
    

    If the rows rather than the columns need to be filtered, remove the axis parameter or use axis=0:

    >>> df.dropna(thresh=3)