Search code examples
pythonpandasgroup-by

Pandas aggregated groupby has incorrect size


I have a puzzling situation with pandas groupby objects. I'm in a situation where I have a dataset with ids, features, and targets for training a machine learning model. In some cases, there are groups of features with differing target values, and since that doesn't make sense, I would like to compute the mean of target values within those groups.

id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
ad_id = id_cols[1]
creative_id = id_cols[-1]

Unfortunately though, as I add a larger number of features names (there are around 200) into the groupby operation, the aggregated (means) dataset changes shape. In my understanding, the shape of the resulting dataset of means should be exactly the number of unique groups. After adding a threshold number of features, the number of entries in the aggregated dataset goes down to small numbers:

for n in [10,20,30, 35, 40, 50,100,200]:
    grpby = df.groupby(features[:n])
    mean_targets = grpby[target].agg(["mean"])
    print(n, len(grpby), mean_targets.shape)
# 10 1349 (1349, 1)
# 20 1882 (1882, 1)
# 30 1978 (1978, 1)
# 35 1978 (31, 1)
# 40 1978 (31, 1)
# 50 1978 (31, 1)
# 100 1978 (19, 1)
# 200 4870 (2, 1)

As you can see, after I add 35 features, my mean_targets series shape doesn't match the number of groups in the groupby object anymore.

What could I be doing wrong - or could this be a pandas groupby limitation?


Solution

  • Your problem comes from a large number of nan in the feature column.

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    data = {
        'id1': np.random.randint(1, 100, 10000),
        'ad_id': np.random.randint(1, 50, 10000),
        'id3': np.random.randint(1, 20, 10000),
        'creative_id': np.random.randint(1, 10, 10000),
        'target': np.random.rand(10000)
    }
    
    for i in range(1, 201):
        data[f'feature_{i}'] = np.random.randint(0, 5, 10000)
    
    df = pd.DataFrame(data)
    
    id_cols = list(df.columns[:4])
    features = list(df.columns[4:-1])
    target = df.columns[-1]
    
    def test_groupby(df, features, target):
        for n in [10, 20, 30, 35, 40, 50, 100, 200]:
            try:
                grpby = df.groupby(features[:n])
                mean_targets = grpby[target].agg(["mean"])
                print(n, len(grpby), mean_targets.shape)
            except Exception as e:
                print(f"Error with {n} features: {e}")
    
    test_groupby(df, features, target)
    

    which gives

    10 10000 (10000, 1)
    20 10000 (10000, 1)
    30 10000 (10000, 1)
    35 10000 (10000, 1)
    40 10000 (10000, 1)
    50 10000 (10000, 1)
    100 10000 (10000, 1)
    200 10000 (10000, 1)
    

    While if the feature column contains lots of nan:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    data = {
        'id1': np.random.randint(1, 100, 10000),
        'ad_id': np.random.randint(1, 50, 10000),
        'id3': np.random.randint(1, 20, 10000),
        'creative_id': np.random.randint(1, 10, 10000),
        'target': np.random.rand(10000)
    }
    
    for i in range(1, 201):
        col_data = np.random.randint(0, 5, 10000).astype(float)
        col_data[np.random.choice([True, False], size=10000, p=[0.3, 0.7])] = np.nan  # Introduce NaN values
        data[f'feature_{i}'] = col_data
    
    df = pd.DataFrame(data)
    
    id_cols = list(df.columns[:4])
    features = list(df.columns[4:-1])
    target = df.columns[-1]
    
    def test_groupby(df, features, target):
        for n in [10, 20, 30, 35, 40, 50, 100, 200]:
            try:
                grpby = df.groupby(features[:n])
                mean_targets = grpby[target].agg(["mean"])
                print(n, len(grpby), mean_targets.shape)
            except Exception as e:
                print(f"Error with {n} features: {e}")
    
    test_groupby(df, features, target)
    
    

    your problem is reproduced

    10 10000 (394, 1)
    20 10000 (5, 1)
    30 10000 (0, 1)
    35 10000 (0, 1)
    40 10000 (0, 1)
    50 10000 (0, 1)
    100 10000 (0, 1)
    200 10000 (0, 1)
    

    To counter the effect of NaN values, you could do this:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    data = {
        'id1': np.random.randint(1, 100, 10000),
        'ad_id': np.random.randint(1, 50, 10000),
        'id3': np.random.randint(1, 20, 10000),
        'creative_id': np.random.randint(1, 10, 10000),
        'target': np.random.rand(10000)
    }
    
    for i in range(1, 201):
        col_data = np.random.randint(0, 5, 10000).astype(float)
        col_data[np.random.choice([True, False], size=10000, p=[0.3, 0.7])] = np.nan  
        data[f'feature_{i}'] = col_data
    
    df = pd.DataFrame(data)
    
    id_cols = list(df.columns[:4])
    features = list(df.columns[4:-1])
    target = df.columns[-1]
    
    def test_groupby(df, features, target):
        for n in [10, 20, 30, 35, 40, 50, 100, 200]:
            try:
                grpby = df.groupby(features[:n], dropna=False)
                mean_targets = grpby[target].mean()  
                print(n, len(grpby), mean_targets.shape)
            except Exception as e:
                print(f"Error with {n} features: {e}")
    
    test_groupby(df, features, target)
    

    which gives

    10 10000 (10000,)
    20 10000 (10000,)
    30 10000 (10000,)
    35 10000 (10000,)
    40 10000 (10000,)
    50 10000 (10000,)
    100 10000 (10000,)
    200 10000 (10000,)