I would like to count the missing variables in a dataframe. I have many variables that I wish to count, and many variables that I wish to group by. I know how to do this for one variable in the group, but not several:
import pandas as pd
df=pd.DataFrame({'type':['A', 'A', 'B'], 'var1':[1, None, 2],'var2':[None, None, 3]})
vars=['var1', 'var2']
df.drop('type', 1)[vars].isna().groupby(df.type, sort=False).sum().reset_index()
Attempts:
df=pd.DataFrame({'type':['A', 'A', 'B'], 'type2':['Z', 'Y', 'Y'], 'var1':[1, None, 2],
'var2':[None, None, 3]})
df.drop(['type', 'type2'], 1).isna().groupby(df[['type', 'type2']], sort=False).sum().reset_index()
grouping_vars = ['type', 'type2']
df.drop(grouping_vars, 1).isna().groupby(df[grouping_vars], sort=False).sum().reset_index()
Both attempts above give me
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional.
I'm guessing I'm not understanding something fundamental about groupby
here, so any help is much appreciated!
You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols
) and then use it as a grouper to count the missing values
keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()
Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna
followed by groupby
+ sum
to count the missing values
keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()
type type2 var1 var2
0 A Z 0 1
1 A Y 1 1
2 B Y 0 0