I have been looking at using multiprocessing, cyton and numba to speed up the below part of my code, but I just cannot figure it out. My current conclusion is that my code is ineffective, although it took me a lot of efford to get the code together and working, it's just not at all effective and would appreciate some help on the matter: Below code takes about 78 seconds to run.
daten={'Contract':Contract_Name,'Product':Products,'Relative':Relative_Name,'Start_Date':Start_Dates,'End_Date':End_Dates,'Relevant':Relevant, 'Volume':Volumes,'Hours':Hours}
df=pd.DataFrame(daten)
Matrix=[]
pt = process_time()
df['Start_Date']=pd.to_datetime(df['Start_Date'])
df['End_Date']=pd.to_datetime(df['End_Date'])
h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])
for j in h_prof['DeliveryStart']:
Vector=[]
idx2=h_prof['DeliveryStart'].to_list().index(j)
for i in df['Start_Date']:
check=False
idx=df['Start_Date'].to_list().index(i)
if (i<=j)and(df['End_Date'][idx]>=j):
if df['Product'][idx]=='bunn':
check=True
elif h_prof['P_OP'][idx2]=='P':
check=True
if check==True:
Vector.append(1)
else:
Vector.append(0)
Matrix.append(Vector)
Thanks a lot in advance! H_prof df:
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 8760 non-null object
1 Id 8760 non-null int64
2 DeliveryStart 8760 non-null datetime64[ns]
3 DeliveryEnd 8760 non-null datetime64[ns]
4 Quantity 8760 non-null float64
5 QuantityUnit 8760 non-null object
6 Time 8760 non-null object
7 Hour 8760 non-null int64
8 WkDay 8760 non-null int64
9 P_OP 8760 non-null object
dtypes: datetime64[ns](2), float64(1), int64(3), object(4)
memory usage: 684.5+ KB
Fast solution using numpy
You can use numpy np.ufunc.outer and broadcasting to do your task in a really fast manner:
import numpy as np
m = (
np.less_equal.outer(df['Start_Date'].to_numpy(),
h_prof['DeliveryStart'].to_numpy())
& np.greater_equal.outer(df['End_Date'].to_numpy(),
h_prof['DeliveryStart'].to_numpy())
& (np.equal(df['Product'].to_numpy(), 'bunn')[:,np.newaxis]
| np.equal(h_prof['P_OP'].to_numpy(), 'P')[np.newaxis,:])
).astype(int).T.tolist()
The np.less_equal
is the equivalent to this condition (i<=j)
in your code, the np.greater_equal
to this condition and(df['End_Date'][idx]>=j)
, the first np.equal
to this condition df['Product'][idx]=='bunn'
and the last np.equal
is for this condition h_prof['P_OP'][idx2]=='P'
. Knowing that the last 2 conditions have to be broadcasted using np.newaxis
to fit in the final output dimension.
In terms of gain, with the shapes for datasets being in my tests
df.shape, h_prof.shape
#((120, 3), (140, 2))
you have a gain 1000 times almost with this version and yours.
Answer improving your code
There are improvements to be done keeping the same style of coding.
First instead of doing a loop on the series and then get the index in the only goal to get the value in another column later:
for j in h_prof['DeliveryStart']:
...
idx2=h_prof['DeliveryStart'].to_list().index(j)
...
elif h_prof['P_OP'][idx2]=='P'
one can do:
for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():
which becomes interesting in terms of speed when your data is above hundreds or rows. The same can be done with the second loop:
for i in df['Start_Date']:
...
idx=df['Start_Date'].to_list().index(i)
if ...(df['End_Date'][idx]>=j):
if df['Product'][idx]=='bunn':
that becomes
for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():
So with an overall code being:
Matrix1=[]
df['Start_Date']=pd.to_datetime(df['Start_Date'])
df['End_Date']=pd.to_datetime(df['End_Date'])
h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])
for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():
Vector=[]
for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():
if ((i<=j) and (e>=j)
and ((prod=='bunn') or (pop=='P'))):
Vector.append(1)
else:
Vector.append(0)
Matrix1.append(Vector)
you already get a speed up of x15 with dataset sizes of
df.shape, h_prof.shape
#((120, 3), (140, 2))
knowing that the gain should increase with the size.
Last point if you want to increase gain using this style, these two conditions (prod=='bunn') or (pop=='P')
are duplicated, they could be done outside of the loops.