Search code examples
pythonpandasdataframenumpymissing-data

Faster way to find all columns are with no missing values?


Currently I am using this statement to find all columns in a dataframe that has no missing values, it works fine. but I'm wondering if there is more concise way (albeit, efficient way) to do the same thing?

df.columns[ np.sum(df.isnull()) == 0 ]

Solution

  • To better answer the question one would need to have access to the dataframe in question.

    Without it, there are various method one can use.

    Let's consider the following dataframe as example

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
    df.iloc[0:10, 0] = np.nan
    
    [Out]:
        A   B   C   D
    0 NaN  89  63  41
    1 NaN  12  47   8
    2 NaN  79  76  67
    3 NaN  87  61  38
    4 NaN  28  31  30
    
    1. Method 1 - As OP indicated (we will be use as reference)

      df.columns[ np.sum(df.isnull()) == 0 ]
      
    2. Method 2 - Similar to Method 1, with numpy.sum and pandas.isnull, but with a Lambda function

      df.columns[ df.apply(lambda x: np.sum(x.isnull()) == 0) ]
      
    3. Method 3 - Using numpy.all and pandas.DataFrame.notnull

      columns = df.columns[ np.all(df.notnull(), axis=0) ]
      
    4. Method 4 - Using only pandas built-in modules

      columns = df.columns[ df.isnull().sum() == 0 ]
      
    5. Method 5 - Using pandas.DataFrame.isna (same method used here).

      columns = df.columns[ df.isna().any() == False ]
      

    The output in all is the one that OP wants, more specifically

    Index(['B', 'C', 'D'], dtype='object')
    

    If one times each of the methods with time.perf_counter() (there are additional ways to measure the time of execution), one will get the following

         method          time
    0  method 1  2.999996e-07
    1  method 2  3.000005e-07
    2  method 3  2.000006e-07
    3  method 4  6.000000e-07
    4  method 5  3.999994e-07
    

    enter image description here

    Again, this might change depending on the dataframe that one uses. Also, depending on the requirements (hardware, and business requirements), there might be other ways to achieve the same goal.