It seems that a lot of stuff I do in Pandas has convex computational-time costs in the amount of data I have (e.g. 1 row takes 1 second, 2 rows takes 2.2 seconds, 4 rows takes 6 seconds, etc.).
Why are computational costs not linearly increasing the the amount of data I have? For example, this function I wrote:
def fractrips1brand(trip):
# Get number of transaction rows for THIS sepcific consumer
art = trip[trip.Item_Id.isin(insidegood)].Item_Id.nunique()
output = pd.Series({'numinsidegoods': art })
return output
gr_TILPS = TILPStemp.groupby('uniqueid')
output = gr_TILPS.apply(fractrips1brand)
Seems to exhibit such costs.
Why is it not O(n)
?
It's quite common for functions to have greater-than-linear time complexity.
Sorting, for example, has O(n log n)
complexity.
gr_TILPS = TILPStemp.groupby('uniqueid')
groupby
sorts the keys by default, so this call has at least O(n log n)
complexity. You can turn the sorting off by using
gr_TILPS = TILPStemp.groupby('uniqueid', sort=False)
In Pandas 0.15 and older versions, Series.nunique
(source) calls Series.value_counts
(source) which also sorts the values by default. So this is another function call with O(n log n) complexity. Since this one occurs in fractrips1brand
, the total complexity of gr_TILPS.apply(fractrips1brand)
is at least O(m n log n)
where m
is the number of groups.
Update: In the next release of Pandas (version 0.16.0) Series.nunique
should be significantly faster.