Search code examples
pythonpandaslarge-data

Python Pandas improving calculation time for large datasets currently taking ~400 mins to run


I'm trying to improve performance of a DataFrame I need to build daily, and I wondered if someone had some ideas. I create a simplied example below:

First, I have a dict of DataFrame's like this. This is time series data so updates daily.

import pandas as pd
import numpy as np
import datetime as dt
from scipy import stats

dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]

m_list = [str(i) + 'm' for i in range(0, 15)]
names = [i + j  for i in m_list for j in m_list]

keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]

df_dict = dict(zip(keys, values))    #this is my time series data

Next I have three lists:

#I will build a dict of DataFrames calc attributes for these combos for each df in dict_df above
combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]

periods = [20, 60, 100, 200, 500, 1000, 2000]   #num of datapoints to use from time series
benchmarks = np.random.choice(combos, 25)       #benchmarks to compare combos to

And then here is where I build the DataFrames I need:

def calc_beta (a_series, b_series) :

    covariance = np.cov (a_series, b_series)     
    beta = covariance[0, 1] / covariance[1, 1]    
    
    return beta

data_dict = {}

for i in list(df_dict.keys()) :
    
    attr_list = []
    
    df = df_dict[i]
    
    for c in combos :
        
        c_split = c.split('/')
        combo_list = []
        for cs in c_split :
            _list = [int(x) for x in list(filter(None, cs.split('m')))]
            combo_list.append(_list)
        if len(combo_list) == 2 :
            combo_list.append([np.nan, np.nan])
        
        c1a, c1b, c2a, c2b, c3a, c3b = [item for subl in combo_list for item in subl]
        
        if len(c_split) == 2 :
            l1, l2 = c_split
            _series = df[l1] - df[l2]
            
        if len(c_split) == 3 :
            l1, l2, l3 = c_split
            _series = df[l1] - df[l2] - df[l3]
        
        attr = {
            
            'name' : c,
            'a' : c1a,
            'b' : c1b,
            'c' : c2a,
            'd' : c2b,
            'e' : c3a,
            'f' : c3b,
            'series' : _series,
            'last' : _series[-1]
        }
        
        for p in periods :
            _str = str(p)
            p_series = _series[-p:]
            
            attr['quantile' + _str] = stats.percentileofscore(p_series, attr['last'])
            attr['z_score' + _str] = stats.zscore(p_series)[-1]
            attr['std' + _str] = np.std(p_series)            
            attr['range' + _str] = max(p_series) - min(p_series)
            attr['last_range' + _str] = attr['last'] / attr['range' + _str]
            attr['last_std' + _str] = attr['last'] / attr['std' + _str]        
            
            if p > 100 :
                attr['5d_autocorr' + _str] = p_series.autocorr(-5)
            else :
                attr['5d_autocorr' + _str] = np.nan
                
            for b in benchmarks :
                b_split = b.split('/')
                
                if len(b_split) == 1 :
                    b_series = df[b_split[0]]
                    
                elif len(b_split) == 2 :                    
                    b_series = df[b_split[0]] - df[b_split[1]]  
                
                elif len(b_split) == 3 :                    
                    b_series = df[b_split[0]] - df[b_split[1]] - df[b_split[2]]  
                
                b_series = b_series[-p:]
                
                corr_value = p_series.corr(b_series)
                
                beta_value = calc_beta (p_series, b_series)
                
                corr_ticker = '{}_corr{}'.format(b, _str)
                beta_ticker = '{}_beta{}'.format(b, _str)
                
                attr[corr_ticker] = corr_value    
                attr[beta_ticker] = corr_value    
        
                if p > 500 :
                    attr[b + '_20rolling_corr_mean' + _str] = p_series.rolling(20).corr(b_series).mean()

                    df1 = pd.DataFrame({c : p_series, b : b_series})

                    attr[b + '_20d_rolling_beta_mean' + _str] =  df1.rolling(20) \
                                                                    .cov(df1 , pairwise=True) \
                                                                    .drop([c], axis=1) \
                                                                    .unstack(1) \
                                                                    .droplevel(0, axis=1) \
                                                                    .apply(lambda row: row[c] / row[b], axis=1) \
                                                                    .mean()
        
        attr_list.append(attr)
    
    data_dict[i] = pd.DataFrame(attr_list)

This is generic example of the actual data, but it almost exactly replicates what I'm trying to do, every type of calculation although I'm reduced the number to try to make it simpler.

This last part takes about 40 mins per DataFrame in the Dict, i.e. 400 mins total for this dataset.

I haven't worked with large datasets in the past, from what I understand I need to minimize For loops and Apply functions, which I have, but what else should I be doing? Appreciate any input.

Thank you


Solution

  • So, I went to a dark place to figure out some ways to help here :)

    The long, short of it is the the two functions at the end of your script where p>500 is killing you. When p<500, I can get some performance gains I'll detail later.

    Instead of essentially iterating by combination and filling out your dataframe, I took the approach to start with a dataframe that had all combos (in your example above, 2500 rows). Then work to the right and vectorize where I could. I think there is a lot of improvement to be had here, but I couldn't get it to work as well as I'd like so maybe some else can help.

    Here's the code I ended up with. It starts after your inputs that you entered in your question.

    import pandas as pd
    import numpy as np
    import datetime as dt
    from scipy import stats
    import time
    
    def calc_beta (a_series, b_series) :
        covariance = np.cov (a_series, b_series)
        beta = covariance[0, 1] / covariance[1, 1]
        return beta
    
    dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]
    
    m_list = [str(i) + 'm' for i in range(0, 15)]
    names = [i + j  for i in m_list for j in m_list]
    
    #keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    keys = ['A']
    values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]
    
    df_dict = dict(zip(keys, values))
    
    combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]
    
    #periods = [20, 60, 100, 200, 500, 1000, 2000]   #num of datapoints to use from time series
    periods = [20]   #num of datapoints to use from time series
    benchmarks = np.random.choice(combos, 25)       #benchmarks to compare combos to
    
    data_dict = {}
    
    for i in list(df_dict.keys()):
        df = df_dict[i]
        mydf =  pd.DataFrame(combos, columns=['name'])
        mydf[['a','b','c','d','e','f']]=mydf.name.str.replace('/', '').str.replace('m', ',').str[0:-1].str.split(',', expand=True)
    
        def get_series(a):
            if len(a) == 2 :
                l1, l2 = a
                s = df[l1] - df[l2]
                return s.tolist()
            else:
                l1, l2, l3 = a
                s = df[l1] - df[l2] - df[l3]
                return s.tolist()
    
        mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
        mydf['last'] = mydf['series'].str[-1]
    
        for p in periods:
            _str = str(p)
    
            mydf['quantile' + _str] =  mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
            mydf['z_score' + _str] = mydf.apply(lambda x: stats.zscore(x['series'][-p:])[-1], axis=1)
            mydf['std' + _str] = mydf.apply(lambda x: np.std(x['series'][-p:]), axis=1)
            mydf['range' + _str] = mydf.apply(lambda x: max(x['series'][-p:]) - min(x['series'][-p:]), axis=1)
            mydf['last_range' + _str] = mydf['last'] / mydf['range' + _str]
            mydf['last_std' + _str] = mydf['last'] / mydf['std' + _str]
    
            if p > 100 :
                mydf['5d_autocorr' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).autocorr(-5), axis=1)
            else :
                mydf['5d_autocorr' + _str] = np.nan
    
            def get_series(a):
                if len(a) == 1 :
                    b = df[a[0]]
                    return b.tolist()
                elif len(a) == 2 :
                    b = df[a[0]] - df[a[1]]
                    return b.tolist()
                else:
                    b = df[a[0]] - df[a[1]] - df[a[2]]
                    return b.tolist()
    
            for b in benchmarks:
                corr_ticker = '{}_corr{}'.format(b, _str)
                beta_ticker = '{}_beta{}'.format(b, _str)
    
                b_series = get_series(b.split('/'))[-p:]
    
                mydf[corr_ticker] = mydf.apply(lambda x: stats.pearsonr(np.array(x['series'][-p:]), np.array(b_series))[0], axis=1)
                mydf[beta_ticker] = mydf.apply(lambda x: calc_beta(np.array(x['series'][-p:]), np.array(b_series)), axis=1)
    
                if p > 500 :
                    mydf[b + '_20rolling_corr_mean' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).rolling(20).corr(pd.Series(b_series)).mean(), axis=1)
                    mydf[b + '_20d_rolling_beta_mean' + _str] =  mydf.apply(lambda x: pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}).rolling(20) \
                                                                    .cov(pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}) , pairwise=True) \
                                                                    .drop([x['name']], axis=1) \
                                                                    .unstack(1) \
                                                                    .droplevel(0, axis=1) \
                                                                    .apply(lambda row: row[x['name']] / row[b], axis=1) \
                                                                    .mean(), axis=1)
    
        data_dict[i] = mydf
    

    I only ran one set of 'A' and changed the period. With keeping 'A' constant, and changing the period, I get performance gains shown here. At period = 400, I still get 60% better performance.

    A 20
    Original: Total Time 25.74614143371582
    Revised: Total Time 7.026344299316406
    
    A 200
    Original: Total Time 25.56810474395752
    Revised: Total Time 10.015231847763062
    
    A 400
    Original: Total Time 28.221587419509888
    Revised: Total Time 11.064109802246094
    

    Going to period 501, your original code took 1121.6251230239868 seconds. Mine was about the same. Going from 400 to 501 is adding an enormous amount of time for two functions (repeated over each benchmark).

    If you need those functions and have to calculate them at time of analysis, you should spend your time on those two functions. I found using pandas series is slow and you'll notice I used scipy module for correlation in one instance because the gains are worth it. If you can use numpy directly or scipy module for your last two functions you'll see gains there as well.

    The other place to look is where I'm using lambda functions. this is still row by row like using for loops. I am saving the period series so I can use it calculations that follow:

    def get_series(a):
        if len(a) == 2 :
            l1, l2 = a
            s = df[l1] - df[l2]
            return s.tolist()
        else:
            l1, l2, l3 = a
            s = df[l1] - df[l2] - df[l3]
            return s.tolist()
    
    mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
    

    this series is made of lists and are passed into lambda functions. I was hoping to find a way to vectorize this by calculating all rows at the same time but some functions require series and some use a list and I just could figure it out. here's an example:

    mydf['quantile' + _str] =  mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
    

    If you can figure out how to vectorize this, and then apply to those functions where p>500 you'll see some savings.

    In the end, your code or my code, the real issue is those last two functions. Everything else is smaller, but real, savings and adds up but rethinking that last piece can save your day.

    The other option is to either multiprocess or break this up onto multiple machines