I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:
Content Date
ID
Bob birthday 2010.03.01
Bob school 2010.04.01
Tom shopping 2010.02.01
Tom work 2010.09.01
Tom holiday 2010.10.01
I'd like create a subindex for for my ID
and the resulting dataframe looks like below:
Content Date
ID subindex
Bob 1 birthday 2010.03.01
2 school 2010.04.01
Tom 1 shopping 2010.02.01
2 work 2010.09.01
3 holiday 2010.10.01
To do this I need to first create my subindex
list. I searched in the help document and it seems to most neat way is to use transform
:
subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))
However, it is really slow. I looked around and found apply
can do the work too:
subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))
Of course the subindex
needed to be flattened since it is a list of lists here. This works much faster than the transform
method. Then I tested with a for loop
of my own:
subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
subindex.extend(np.arange(1,subindex_size[i]+1))
It's even faster. With my larger dataset (about 90k rows), the transform
method takes about 44 secs on my computer, apply
takes ~2 secs and the for loop
takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply
and for loop
makes a difference to me. However, the for loop
looks ugly and may not be easily applied if I need to create other group-based variables.
So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?
You can use cumcount to do this:
In [11]: df.groupby(level=0).cumcount()
Out[11]:
ID
Bob 0
Bob 1
Tom 0
Tom 1
Tom 2
dtype: int64
In [12]: df['subindex'] = df.groupby(level=0).cumcount() # possibly + 1 here.
In [13]: df.set_index('subindex', append=True)
Out[13]:
Content Date
ID subindex
Bob 0 birthday 2010.03.01
1 school 2010.04.01
Tom 0 shopping 2010.02.01
1 work 2010.09.01
2 holiday 2010.10.01
To start at 1 (rather than 0) just add 1 to the result of cumcount.