Search code examples
pythonpandasgroup-by

Calculate Null Values


I have a table and want to know the null counts but I get an unexpected error.

The table is below:

outage_id   isp_name    start_time  end_time    affected_area   reported_issues
1   ISP D   2023-01-01 00:00:00 2023-01-01 01:00:00 Area 6  39
2   ISP E   2023-01-01 01:00:00 2023-01-01 05:00:00 Area 3  47
3   ISP C   2023-01-01 02:00:00 2023-01-01 03:00:00 Area 4  60
4   ISP E   2023-01-01 03:00:00 2023-01-01 04:00:00 Area 4  63
5   ISP E   2023-01-01 04:00:00 NULL    Area 3  17
6   ISP B   2023-01-01 05:00:00 2023-01-01 06:00:00 Area 3  36
7   ISP C   2023-01-01 06:00:00 NULL    Area 3  36
8   ISP C   2023-01-01 07:00:00 NULL    Area 4  30
9   ISP C   2023-01-01 08:00:00 2023-01-01 09:00:00 Area 7  39
10  ISP E   2023-01-01 09:00:00 2023-01-01 12:00:00 Area 4  37
11  ISP D   2023-01-01 10:00:00 2023-01-01 11:00:00 Area 9  73
12  ISP C   2023-01-01 11:00:00 2023-01-01 13:00:00 Area 1  78
13  ISP E   2023-01-01 12:00:00 NULL    Area 8  70
14  ISP B   2023-01-01 13:00:00 2023-01-01 17:00:00 Area 7  57
15  ISP D   2023-01-01 14:00:00 2023-01-01 15:00:00 Area 2  28
16  ISP B   2023-01-01 15:00:00 NULL    Area 8  13
17  ISP D   2023-01-01 16:00:00 2023-01-01 20:00:00 Area 1  44
18  ISP E   2023-01-01 17:00:00 2023-01-01 21:00:00 Area 9  73
19  ISP A   2023-01-01 18:00:00 2023-01-01 21:00:00 Area 9  58
20  ISP D   2023-01-01 19:00:00 2023-01-01 22:00:00 Area 2  26
21  ISP B   2023-01-01 20:00:00 NULL    Area 7  53
22  ISP E   2023-01-01 21:00:00 2023-01-01 22:00:00 Area 3  39
23  ISP D   2023-01-01 22:00:00 2023-01-02 01:00:00 Area 7  55
24  ISP A   2023-01-01 23:00:00 2023-01-02 02:00:00 Area 9  15
25  ISP A   2023-01-02 00:00:00 NULL    Area 4  46
26  ISP C   2023-01-02 01:00:00 2023-01-02 03:00:00 Area 1  33
27  ISP C   2023-01-02 02:00:00 NULL    Area 2  55
28  ISP B   2023-01-02 03:00:00 NULL    Area 1  62
29  ISP D   2023-01-02 04:00:00 NULL    Area 5  69
30  ISP D   2023-01-02 05:00:00 2023-01-02 09:00:00 Area 5  72
31  ISP C   2023-01-02 06:00:00 2023-01-02 10:00:00 Area 7  94
32  ISP D   2023-01-02 07:00:00 2023-01-02 10:00:00 Area 9  41
33  ISP D   2023-01-02 08:00:00 2023-01-02 11:00:00 Area 9  96
34  ISP A   2023-01-02 09:00:00 2023-01-02 13:00:00 Area 3  42
35  ISP C   2023-01-02 10:00:00 2023-01-02 13:00:00 Area 3  76
36  ISP E   2023-01-02 11:00:00 2023-01-02 13:00:00 Area 3  27
37  ISP C   2023-01-02 12:00:00 2023-01-02 14:00:00 Area 4  34
38  ISP E   2023-01-02 13:00:00 2023-01-02 14:00:00 Area 8  63
39  ISP A   2023-01-02 14:00:00 NULL    Area 6  67
40  ISP B   2023-01-02 15:00:00 2023-01-02 19:00:00 Area 8  76
41  ISP D   2023-01-02 16:00:00 NULL    Area 1  55

My codes are below,

``
df.end_time.isnull().groupby('isp_name').sum() 
#or
df.end_time.isnull().groupby('isp_name').transform('sum') 
``

It shows KeyError: 'isp_name'.

I do have the 'isp_name' column and I don't know why have this issue.

Thanks


Solution

  • After you sliced with df['end_time'].isnull() you only have a Series and the isp_name column is no longer available.

    You need to groupby using a Series:

    df['end_time'].isnull().groupby(df['isp_name']).sum() 
    

    Or use a custom aggregation function:

    df.groupby('isp_name')['end_time'].agg(lambda x: x.isnull().sum())
    

    Output:

    isp_name
    ISP A    2
    ISP B    6
    ISP C    3
    ISP D    4
    ISP E    3
    Name: end_time, dtype: int64
    

    Last option, for most flexibility, assign a new column:

    (df.assign(non_null=lambda d: d['end_time'].isnull())
       .groupby('isp_name')['non_null'].sum()
    )