Search code examples
pythonpandascomputation

When is the computational cost of applying a function convex?


It seems that a lot of stuff I do in Pandas has convex computational-time costs in the amount of data I have (e.g. 1 row takes 1 second, 2 rows takes 2.2 seconds, 4 rows takes 6 seconds, etc.).

Why are computational costs not linearly increasing the the amount of data I have? For example, this function I wrote:

def fractrips1brand(trip): 
    # Get number of transaction rows for THIS sepcific consumer
    art = trip[trip.Item_Id.isin(insidegood)].Item_Id.nunique()
    output = pd.Series({'numinsidegoods': art })
    return output


gr_TILPS = TILPStemp.groupby('uniqueid')
output = gr_TILPS.apply(fractrips1brand)

Seems to exhibit such costs.

Why is it not O(n)?


Solution

  • It's quite common for functions to have greater-than-linear time complexity. Sorting, for example, has O(n log n) complexity.

    gr_TILPS = TILPStemp.groupby('uniqueid')
    

    groupby sorts the keys by default, so this call has at least O(n log n) complexity. You can turn the sorting off by using

    gr_TILPS = TILPStemp.groupby('uniqueid', sort=False)
    

    In Pandas 0.15 and older versions, Series.nunique (source) calls Series.value_counts (source) which also sorts the values by default. So this is another function call with O(n log n) complexity. Since this one occurs in fractrips1brand, the total complexity of gr_TILPS.apply(fractrips1brand) is at least O(m n log n) where m is the number of groups.

    Update: In the next release of Pandas (version 0.16.0) Series.nunique should be significantly faster.