python-3.x pandas for-loop multiprocessing numba

How can I effectivly speed up my loops through Pandas for the belove code?

I have been looking at using multiprocessing, cyton and numba to speed up the below part of my code, but I just cannot figure it out. My current conclusion is that my code is ineffective, although it took me a lot of efford to get the code together and working, it's just not at all effective and would appreciate some help on the matter: Below code takes about 78 seconds to run.

daten={'Contract':Contract_Name,'Product':Products,'Relative':Relative_Name,'Start_Date':Start_Dates,'End_Date':End_Dates,'Relevant':Relevant, 'Volume':Volumes,'Hours':Hours}
df=pd.DataFrame(daten)

Matrix=[]
pt = process_time()
df['Start_Date']=pd.to_datetime(df['Start_Date'])
df['End_Date']=pd.to_datetime(df['End_Date'])
h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])

for j in h_prof['DeliveryStart']:
        Vector=[]
        idx2=h_prof['DeliveryStart'].to_list().index(j)
        for i in df['Start_Date']:
            check=False
            idx=df['Start_Date'].to_list().index(i)
            if (i<=j)and(df['End_Date'][idx]>=j):
                if df['Product'][idx]=='bunn':
                    check=True
                elif h_prof['P_OP'][idx2]=='P':
                    check=True
            if check==True:
                Vector.append(1)
            else:
                Vector.append(0)
        Matrix.append(Vector)

Thanks a lot in advance! H_prof df:

RangeIndex: 8760 entries, 0 to 8759
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Name            8760 non-null   object        
 1   Id              8760 non-null   int64         
 2   DeliveryStart   8760 non-null   datetime64[ns]
 3   DeliveryEnd     8760 non-null   datetime64[ns]
 4   Quantity        8760 non-null   float64       
 5   QuantityUnit    8760 non-null   object        
 6   Time            8760 non-null   object        
 7   Hour            8760 non-null   int64         
 8   WkDay           8760 non-null   int64         
 9   P_OP            8760 non-null   object        
dtypes: datetime64[ns](2), float64(1), int64(3), object(4)
memory usage: 684.5+ KB

Solution

Fast solution using numpy

You can use numpy np.ufunc.outer and broadcasting to do your task in a really fast manner:

import numpy as np

m = (
    np.less_equal.outer(df['Start_Date'].to_numpy(), 
                        h_prof['DeliveryStart'].to_numpy())
    & np.greater_equal.outer(df['End_Date'].to_numpy(), 
                             h_prof['DeliveryStart'].to_numpy())
    & (np.equal(df['Product'].to_numpy(), 'bunn')[:,np.newaxis]
       | np.equal(h_prof['P_OP'].to_numpy(), 'P')[np.newaxis,:])
).astype(int).T.tolist()

The np.less_equal is the equivalent to this condition (i<=j) in your code, the np.greater_equal to this condition and(df['End_Date'][idx]>=j), the first np.equal to this condition df['Product'][idx]=='bunn' and the last np.equal is for this condition h_prof['P_OP'][idx2]=='P'. Knowing that the last 2 conditions have to be broadcasted using np.newaxis to fit in the final output dimension.

In terms of gain, with the shapes for datasets being in my tests

df.shape, h_prof.shape
#((120, 3), (140, 2))

you have a gain 1000 times almost with this version and yours.

Answer improving your code

There are improvements to be done keeping the same style of coding.

First instead of doing a loop on the series and then get the index in the only goal to get the value in another column later:

for j in h_prof['DeliveryStart']:
        ...
        idx2=h_prof['DeliveryStart'].to_list().index(j)
        ...
           elif h_prof['P_OP'][idx2]=='P'

one can do:

for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():

which becomes interesting in terms of speed when your data is above hundreds or rows. The same can be done with the second loop:

for i in df['Start_Date']:
    ...
    idx=df['Start_Date'].to_list().index(i)
    if ...(df['End_Date'][idx]>=j):
        if df['Product'][idx]=='bunn':

that becomes

for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():

So with an overall code being:

Matrix1=[]
df['Start_Date']=pd.to_datetime(df['Start_Date'])
df['End_Date']=pd.to_datetime(df['End_Date'])
h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])

for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():
        Vector=[]
        for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():
            if ((i<=j) and (e>=j)
               and ((prod=='bunn') or (pop=='P'))): 
                Vector.append(1)
            else: 
                Vector.append(0)
        Matrix1.append(Vector)

you already get a speed up of x15 with dataset sizes of

df.shape, h_prof.shape
#((120, 3), (140, 2))

knowing that the gain should increase with the size.

Last point if you want to increase gain using this style, these two conditions (prod=='bunn') or (pop=='P') are duplicated, they could be done outside of the loops.