Search code examples
pythonpandasmissing-data

Best way to count the number of rows with missing values in a pandas DataFrame


I currently came up with some work arounds to count the number of missing values in a pandas DataFrame. Those are quite ugly and I am wondering if there is a better way to do it.

Let's create an example DataFrame:

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
               columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

enter image description here

What I currently have is

a) Counting cells with missing values:

>>> sum(df.isnull().values.ravel())
9

b) Counting rows that have missing values somewhere:

>>> sum([True for idx,row in df.iterrows() if any(row.isnull())])
3

Solution

  • For the second count I think just subtract the number of rows from the number of rows returned from dropna:

    In [14]:
    
    from numpy.random import randn
    df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
                   columns=['one', 'two', 'three'])
    df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
    df
    Out[14]:
            one       two     three
    a -0.209453 -0.881878  3.146375
    b       NaN       NaN       NaN
    c  0.049383 -0.698410 -0.482013
    d       NaN       NaN       NaN
    e -0.140198 -1.285411  0.547451
    f -0.219877  0.022055 -2.116037
    g       NaN       NaN       NaN
    h -0.224695 -0.025628 -0.703680
    In [18]:
    
    df.shape[0] - df.dropna().shape[0]
    Out[18]:
    3
    

    The first could be achieved using the built in methods:

    In [30]:
    
    df.isnull().values.ravel().sum()
    Out[30]:
    9
    

    Timings

    In [34]:
    
    %timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
    %timeit df.shape[0] - df.dropna().shape[0]
    %timeit sum(map(any, df.apply(pd.isnull)))
    1000 loops, best of 3: 1.55 ms per loop
    1000 loops, best of 3: 1.11 ms per loop
    1000 loops, best of 3: 1.82 ms per loop
    In [33]:
    
    %timeit sum(df.isnull().values.ravel())
    %timeit df.isnull().values.ravel().sum()
    %timeit df.isnull().sum().sum()
    1000 loops, best of 3: 215 µs per loop
    1000 loops, best of 3: 210 µs per loop
    1000 loops, best of 3: 605 µs per loop
    

    So my alternatives are a little faster for a df of this size

    Update

    So for a df with 80,000 rows I get the following:

    In [39]:
    
    %timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
    %timeit df.shape[0] - df.dropna().shape[0]
    %timeit sum(map(any, df.apply(pd.isnull)))
    %timeit np.count_nonzero(df.isnull())
    1 loops, best of 3: 9.33 s per loop
    100 loops, best of 3: 6.61 ms per loop
    100 loops, best of 3: 3.84 ms per loop
    1000 loops, best of 3: 395 µs per loop
    In [40]:
    
    %timeit sum(df.isnull().values.ravel())
    %timeit df.isnull().values.ravel().sum()
    %timeit df.isnull().sum().sum()
    %timeit np.count_nonzero(df.isnull().values.ravel())
    1000 loops, best of 3: 675 µs per loop
    1000 loops, best of 3: 679 µs per loop
    100 loops, best of 3: 6.56 ms per loop
    1000 loops, best of 3: 368 µs per loop
    

    Actually np.count_nonzero wins this hands down.