Search code examples
pythonpandasperformancerunning-count

How to create subindex efficiently?


I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

I'd like create a subindex for for my ID and the resulting dataframe looks like below:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

However, it is really slow. I looked around and found apply can do the work too:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.

So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?


Solution

  • You can use cumcount to do this:

    In [11]: df.groupby(level=0).cumcount()
    Out[11]: 
    ID
    Bob    0
    Bob    1
    Tom    0
    Tom    1
    Tom    2
    dtype: int64
    
    In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.
    
    In [13]: df.set_index('subindex', append=True)
    Out[13]: 
                   Content        Date
    ID  subindex                      
    Bob 0         birthday  2010.03.01
        1           school  2010.04.01
    Tom 0         shopping  2010.02.01
        1             work  2010.09.01
        2          holiday  2010.10.01
    

    To start at 1 (rather than 0) just add 1 to the result of cumcount.