I have a puzzling situation with pandas groupby objects. I'm in a situation where I have a dataset with ids, features, and targets for training a machine learning model. In some cases, there are groups of features with differing target values, and since that doesn't make sense, I would like to compute the mean of target values within those groups.
id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
ad_id = id_cols[1]
creative_id = id_cols[-1]
Unfortunately though, as I add a larger number of features names (there are around 200) into the groupby operation, the aggregated (means) dataset changes shape. In my understanding, the shape of the resulting dataset of means should be exactly the number of unique groups. After adding a threshold number of features, the number of entries in the aggregated dataset goes down to small numbers:
for n in [10,20,30, 35, 40, 50,100,200]:
grpby = df.groupby(features[:n])
mean_targets = grpby[target].agg(["mean"])
print(n, len(grpby), mean_targets.shape)
# 10 1349 (1349, 1)
# 20 1882 (1882, 1)
# 30 1978 (1978, 1)
# 35 1978 (31, 1)
# 40 1978 (31, 1)
# 50 1978 (31, 1)
# 100 1978 (19, 1)
# 200 4870 (2, 1)
As you can see, after I add 35 features, my mean_targets series shape doesn't match the number of groups in the groupby object anymore.
What could I be doing wrong - or could this be a pandas groupby limitation?
Your problem comes from a large number of nan
in the feature column.
import pandas as pd
import numpy as np
np.random.seed(0)
data = {
'id1': np.random.randint(1, 100, 10000),
'ad_id': np.random.randint(1, 50, 10000),
'id3': np.random.randint(1, 20, 10000),
'creative_id': np.random.randint(1, 10, 10000),
'target': np.random.rand(10000)
}
for i in range(1, 201):
data[f'feature_{i}'] = np.random.randint(0, 5, 10000)
df = pd.DataFrame(data)
id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
def test_groupby(df, features, target):
for n in [10, 20, 30, 35, 40, 50, 100, 200]:
try:
grpby = df.groupby(features[:n])
mean_targets = grpby[target].agg(["mean"])
print(n, len(grpby), mean_targets.shape)
except Exception as e:
print(f"Error with {n} features: {e}")
test_groupby(df, features, target)
which gives
10 10000 (10000, 1)
20 10000 (10000, 1)
30 10000 (10000, 1)
35 10000 (10000, 1)
40 10000 (10000, 1)
50 10000 (10000, 1)
100 10000 (10000, 1)
200 10000 (10000, 1)
While if the feature column contains lots of nan:
import pandas as pd
import numpy as np
np.random.seed(0)
data = {
'id1': np.random.randint(1, 100, 10000),
'ad_id': np.random.randint(1, 50, 10000),
'id3': np.random.randint(1, 20, 10000),
'creative_id': np.random.randint(1, 10, 10000),
'target': np.random.rand(10000)
}
for i in range(1, 201):
col_data = np.random.randint(0, 5, 10000).astype(float)
col_data[np.random.choice([True, False], size=10000, p=[0.3, 0.7])] = np.nan # Introduce NaN values
data[f'feature_{i}'] = col_data
df = pd.DataFrame(data)
id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
def test_groupby(df, features, target):
for n in [10, 20, 30, 35, 40, 50, 100, 200]:
try:
grpby = df.groupby(features[:n])
mean_targets = grpby[target].agg(["mean"])
print(n, len(grpby), mean_targets.shape)
except Exception as e:
print(f"Error with {n} features: {e}")
test_groupby(df, features, target)
your problem is reproduced
10 10000 (394, 1)
20 10000 (5, 1)
30 10000 (0, 1)
35 10000 (0, 1)
40 10000 (0, 1)
50 10000 (0, 1)
100 10000 (0, 1)
200 10000 (0, 1)
To counter the effect of NaN values, you could do this:
import pandas as pd
import numpy as np
np.random.seed(0)
data = {
'id1': np.random.randint(1, 100, 10000),
'ad_id': np.random.randint(1, 50, 10000),
'id3': np.random.randint(1, 20, 10000),
'creative_id': np.random.randint(1, 10, 10000),
'target': np.random.rand(10000)
}
for i in range(1, 201):
col_data = np.random.randint(0, 5, 10000).astype(float)
col_data[np.random.choice([True, False], size=10000, p=[0.3, 0.7])] = np.nan
data[f'feature_{i}'] = col_data
df = pd.DataFrame(data)
id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
def test_groupby(df, features, target):
for n in [10, 20, 30, 35, 40, 50, 100, 200]:
try:
grpby = df.groupby(features[:n], dropna=False)
mean_targets = grpby[target].mean()
print(n, len(grpby), mean_targets.shape)
except Exception as e:
print(f"Error with {n} features: {e}")
test_groupby(df, features, target)
which gives
10 10000 (10000,)
20 10000 (10000,)
30 10000 (10000,)
35 10000 (10000,)
40 10000 (10000,)
50 10000 (10000,)
100 10000 (10000,)
200 10000 (10000,)