I'm trying to improve performance of a DataFrame I need to build daily, and I wondered if someone had some ideas. I create a simplied example below:
First, I have a dict
of DataFrame
's like this. This is time series data so updates daily.
import pandas as pd
import numpy as np
import datetime as dt
from scipy import stats
dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]
m_list = [str(i) + 'm' for i in range(0, 15)]
names = [i + j for i in m_list for j in m_list]
keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]
df_dict = dict(zip(keys, values)) #this is my time series data
Next I have three lists:
#I will build a dict of DataFrames calc attributes for these combos for each df in dict_df above
combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]
periods = [20, 60, 100, 200, 500, 1000, 2000] #num of datapoints to use from time series
benchmarks = np.random.choice(combos, 25) #benchmarks to compare combos to
And then here is where I build the DataFrames I need:
def calc_beta (a_series, b_series) :
covariance = np.cov (a_series, b_series)
beta = covariance[0, 1] / covariance[1, 1]
return beta
data_dict = {}
for i in list(df_dict.keys()) :
attr_list = []
df = df_dict[i]
for c in combos :
c_split = c.split('/')
combo_list = []
for cs in c_split :
_list = [int(x) for x in list(filter(None, cs.split('m')))]
combo_list.append(_list)
if len(combo_list) == 2 :
combo_list.append([np.nan, np.nan])
c1a, c1b, c2a, c2b, c3a, c3b = [item for subl in combo_list for item in subl]
if len(c_split) == 2 :
l1, l2 = c_split
_series = df[l1] - df[l2]
if len(c_split) == 3 :
l1, l2, l3 = c_split
_series = df[l1] - df[l2] - df[l3]
attr = {
'name' : c,
'a' : c1a,
'b' : c1b,
'c' : c2a,
'd' : c2b,
'e' : c3a,
'f' : c3b,
'series' : _series,
'last' : _series[-1]
}
for p in periods :
_str = str(p)
p_series = _series[-p:]
attr['quantile' + _str] = stats.percentileofscore(p_series, attr['last'])
attr['z_score' + _str] = stats.zscore(p_series)[-1]
attr['std' + _str] = np.std(p_series)
attr['range' + _str] = max(p_series) - min(p_series)
attr['last_range' + _str] = attr['last'] / attr['range' + _str]
attr['last_std' + _str] = attr['last'] / attr['std' + _str]
if p > 100 :
attr['5d_autocorr' + _str] = p_series.autocorr(-5)
else :
attr['5d_autocorr' + _str] = np.nan
for b in benchmarks :
b_split = b.split('/')
if len(b_split) == 1 :
b_series = df[b_split[0]]
elif len(b_split) == 2 :
b_series = df[b_split[0]] - df[b_split[1]]
elif len(b_split) == 3 :
b_series = df[b_split[0]] - df[b_split[1]] - df[b_split[2]]
b_series = b_series[-p:]
corr_value = p_series.corr(b_series)
beta_value = calc_beta (p_series, b_series)
corr_ticker = '{}_corr{}'.format(b, _str)
beta_ticker = '{}_beta{}'.format(b, _str)
attr[corr_ticker] = corr_value
attr[beta_ticker] = corr_value
if p > 500 :
attr[b + '_20rolling_corr_mean' + _str] = p_series.rolling(20).corr(b_series).mean()
df1 = pd.DataFrame({c : p_series, b : b_series})
attr[b + '_20d_rolling_beta_mean' + _str] = df1.rolling(20) \
.cov(df1 , pairwise=True) \
.drop([c], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row[c] / row[b], axis=1) \
.mean()
attr_list.append(attr)
data_dict[i] = pd.DataFrame(attr_list)
This is generic example of the actual data, but it almost exactly replicates what I'm trying to do, every type of calculation although I'm reduced the number to try to make it simpler.
This last part takes about 40 mins per DataFrame in the Dict, i.e. 400 mins total for this dataset.
I haven't worked with large datasets in the past, from what I understand I need to minimize For loops and Apply functions, which I have, but what else should I be doing? Appreciate any input.
Thank you
So, I went to a dark place to figure out some ways to help here :)
The long, short of it is the the two functions at the end of your script where p>500 is killing you. When p<500, I can get some performance gains I'll detail later.
Instead of essentially iterating by combination and filling out your dataframe, I took the approach to start with a dataframe that had all combos (in your example above, 2500 rows). Then work to the right and vectorize where I could. I think there is a lot of improvement to be had here, but I couldn't get it to work as well as I'd like so maybe some else can help.
Here's the code I ended up with. It starts after your inputs that you entered in your question.
import pandas as pd
import numpy as np
import datetime as dt
from scipy import stats
import time
def calc_beta (a_series, b_series) :
covariance = np.cov (a_series, b_series)
beta = covariance[0, 1] / covariance[1, 1]
return beta
dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]
m_list = [str(i) + 'm' for i in range(0, 15)]
names = [i + j for i in m_list for j in m_list]
#keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
keys = ['A']
values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]
df_dict = dict(zip(keys, values))
combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]
#periods = [20, 60, 100, 200, 500, 1000, 2000] #num of datapoints to use from time series
periods = [20] #num of datapoints to use from time series
benchmarks = np.random.choice(combos, 25) #benchmarks to compare combos to
data_dict = {}
for i in list(df_dict.keys()):
df = df_dict[i]
mydf = pd.DataFrame(combos, columns=['name'])
mydf[['a','b','c','d','e','f']]=mydf.name.str.replace('/', '').str.replace('m', ',').str[0:-1].str.split(',', expand=True)
def get_series(a):
if len(a) == 2 :
l1, l2 = a
s = df[l1] - df[l2]
return s.tolist()
else:
l1, l2, l3 = a
s = df[l1] - df[l2] - df[l3]
return s.tolist()
mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
mydf['last'] = mydf['series'].str[-1]
for p in periods:
_str = str(p)
mydf['quantile' + _str] = mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
mydf['z_score' + _str] = mydf.apply(lambda x: stats.zscore(x['series'][-p:])[-1], axis=1)
mydf['std' + _str] = mydf.apply(lambda x: np.std(x['series'][-p:]), axis=1)
mydf['range' + _str] = mydf.apply(lambda x: max(x['series'][-p:]) - min(x['series'][-p:]), axis=1)
mydf['last_range' + _str] = mydf['last'] / mydf['range' + _str]
mydf['last_std' + _str] = mydf['last'] / mydf['std' + _str]
if p > 100 :
mydf['5d_autocorr' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).autocorr(-5), axis=1)
else :
mydf['5d_autocorr' + _str] = np.nan
def get_series(a):
if len(a) == 1 :
b = df[a[0]]
return b.tolist()
elif len(a) == 2 :
b = df[a[0]] - df[a[1]]
return b.tolist()
else:
b = df[a[0]] - df[a[1]] - df[a[2]]
return b.tolist()
for b in benchmarks:
corr_ticker = '{}_corr{}'.format(b, _str)
beta_ticker = '{}_beta{}'.format(b, _str)
b_series = get_series(b.split('/'))[-p:]
mydf[corr_ticker] = mydf.apply(lambda x: stats.pearsonr(np.array(x['series'][-p:]), np.array(b_series))[0], axis=1)
mydf[beta_ticker] = mydf.apply(lambda x: calc_beta(np.array(x['series'][-p:]), np.array(b_series)), axis=1)
if p > 500 :
mydf[b + '_20rolling_corr_mean' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).rolling(20).corr(pd.Series(b_series)).mean(), axis=1)
mydf[b + '_20d_rolling_beta_mean' + _str] = mydf.apply(lambda x: pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}).rolling(20) \
.cov(pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}) , pairwise=True) \
.drop([x['name']], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row[x['name']] / row[b], axis=1) \
.mean(), axis=1)
data_dict[i] = mydf
I only ran one set of 'A' and changed the period. With keeping 'A' constant, and changing the period, I get performance gains shown here. At period = 400, I still get 60% better performance.
A 20
Original: Total Time 25.74614143371582
Revised: Total Time 7.026344299316406
A 200
Original: Total Time 25.56810474395752
Revised: Total Time 10.015231847763062
A 400
Original: Total Time 28.221587419509888
Revised: Total Time 11.064109802246094
Going to period 501, your original code took 1121.6251230239868 seconds. Mine was about the same. Going from 400 to 501 is adding an enormous amount of time for two functions (repeated over each benchmark).
If you need those functions and have to calculate them at time of analysis, you should spend your time on those two functions. I found using pandas series is slow and you'll notice I used scipy module for correlation in one instance because the gains are worth it. If you can use numpy directly or scipy module for your last two functions you'll see gains there as well.
The other place to look is where I'm using lambda functions. this is still row by row like using for loops. I am saving the period series so I can use it calculations that follow:
def get_series(a):
if len(a) == 2 :
l1, l2 = a
s = df[l1] - df[l2]
return s.tolist()
else:
l1, l2, l3 = a
s = df[l1] - df[l2] - df[l3]
return s.tolist()
mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
this series is made of lists and are passed into lambda functions. I was hoping to find a way to vectorize this by calculating all rows at the same time but some functions require series and some use a list and I just could figure it out. here's an example:
mydf['quantile' + _str] = mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
If you can figure out how to vectorize this, and then apply to those functions where p>500 you'll see some savings.
In the end, your code or my code, the real issue is those last two functions. Everything else is smaller, but real, savings and adds up but rethinking that last piece can save your day.
The other option is to either multiprocess or break this up onto multiple machines