Currently I am using this statement to find all columns in a dataframe that has no missing values, it works fine. but I'm wondering if there is more concise way (albeit, efficient way) to do the same thing?
df.columns[ np.sum(df.isnull()) == 0 ]
To better answer the question one would need to have access to the dataframe in question.
Without it, there are various method one can use.
Let's consider the following dataframe as example
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
df.iloc[0:10, 0] = np.nan
[Out]:
A B C D
0 NaN 89 63 41
1 NaN 12 47 8
2 NaN 79 76 67
3 NaN 87 61 38
4 NaN 28 31 30
Method 1 - As OP indicated (we will be use as reference)
df.columns[ np.sum(df.isnull()) == 0 ]
Method 2 - Similar to Method 1, with numpy.sum
and pandas.isnull
, but with a Lambda function
df.columns[ df.apply(lambda x: np.sum(x.isnull()) == 0) ]
Method 3 - Using numpy.all
and pandas.DataFrame.notnull
columns = df.columns[ np.all(df.notnull(), axis=0) ]
Method 4 - Using only pandas built-in modules
columns = df.columns[ df.isnull().sum() == 0 ]
Method 5 - Using pandas.DataFrame.isna
(same method used here).
columns = df.columns[ df.isna().any() == False ]
The output in all is the one that OP wants, more specifically
Index(['B', 'C', 'D'], dtype='object')
If one times each of the methods with time.perf_counter()
(there are additional ways to measure the time of execution), one will get the following
method time
0 method 1 2.999996e-07
1 method 2 3.000005e-07
2 method 3 2.000006e-07
3 method 4 6.000000e-07
4 method 5 3.999994e-07
Again, this might change depending on the dataframe that one uses. Also, depending on the requirements (hardware, and business requirements), there might be other ways to achieve the same goal.