Search code examples
pythonpandasdataframerowdeleting

Deleting rows in pandas data frame after evaluating all columns


I have a very large pandas DataFrame (>100 million rows, and >1000s of columns). Each row has a unique label as index, for most of the rows, only one column contains value. I want to make a new DataFrame by deleting those rows with only one of the columns has value, and keeping those rows that with more than two columns have values.


Solution

  • You can drop them using dropna:

    In [3]:
    #sample df
    df = pd.DataFrame({'a':[0,NaN, 2,3,4], 'b':[0,NaN, 2,3,NaN], 'c':arange(5)})
    df
    
    Out[3]:
        a   b  c
    0   0   0  0
    1 NaN NaN  1
    2   2   2  2
    3   3   3  3
    4   4 NaN  4
    In [5]:
    # drop just the rows which have 2 or more NaN values
    df.dropna(thresh=2, axis=0)
    Out[5]:
       a   b  c
    0  0   0  0
    2  2   2  2
    3  3   3  3
    4  4 NaN  4
    

    You pass the params thresh=2 to specify that you require at least 2 non-NA values, and axis=0 will specify that the criteria should be applied row-wise.