Search code examples
pandasdataframegroup-bymissing-data

Missing rate per column per month in pandas Dataframe


Let's take the following pd.DataFrame

>>> df = pd.DataFrame({
    'M'   : ['1', '1' , '3',  '6',  '6',  '6'],
    'col1': [None, 0.1, None, 0.2,  0.3,  0.4],
    'col2': [0.01, 0.1, 1.3,  None, None, 0.5]})

which creates

    M   col1  col2
0   1   NaN   0.01
1   1   0.1   0.10
2   3   NaN   1.30
3   6   0.2   NaN
4   6   0.3   NaN
5   6   0.4   0.50

I would now like to have the missing rate percentage per month per column. The resulting table should look like this

M   col1  col2
1   50.0   0.0
3  100.0   0.0
6    0.0  66.6

where the values in the cells in col1 and col2 state the missing rates per month for the column.

How can I do this?


Solution

  • You can use a groupby.mean on the boolean columns:

    out = (df.drop(columns='M').isna()  # check if the value is missing
             .groupby(df['M'])          # for each M
             .mean().mul(100).round(2)  # get the proportion x 100
             .reset_index()             # index as column
          )
    

    output:

       M   col1   col2
    0  1   50.0   0.00
    1  3  100.0   0.00
    2  6    0.0  66.67