IndexError: index 0 is out of bounds for axis 0 with size 0 for trying to find mode (most frequent value)

I concatenated 500 XSLX-files, which has the shape (672006, 12). All processes have a unique number, which I want to groupby() the data to obtain relevant information. For temperature I would like to select the first and for number the most frequent value.

Test data:

df_test = 
pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3], 
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4], 
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80]})

df_test.groupby('number')['temperature'].first()

df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])

I get the following error for trying to getting the most frequent height per number: IndexError: index 0 is out of bounds for axis 0 with size 0

Strange enough, mean() / first() / max() etc are all working. And on the second part of the dataset that I concatenated seperately the aggregation worked.

Can somebody suggest what to do with this error? Thanks!

Solution

I think your problem is one or more of your groupby is returning all NaN heights:

See this example, where I added a number 4 with np.NaN as its heights.

df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4], 
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5], 
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})

df_test.groupby('number')['temperature'].first()

df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])

Output:

IndexError: index 0 is out of bounds for axis 0 with size 0

Let's fill those NaN with zero and rerun.

df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4], 
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5], 
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})

df_test = df_test.fillna(0) #Add this line
df_test.groupby('number')['temperature'].first()

df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])

Output:

number
1    100.0
2     90.0
3     80.0
4      0.0
Name: height, dtype: float64