Search code examples
python-3.xpandasfor-loopmultiprocessingnumba

How can I effectivly speed up my loops through Pandas for the belove code?


I have been looking at using multiprocessing, cyton and numba to speed up the below part of my code, but I just cannot figure it out. My current conclusion is that my code is ineffective, although it took me a lot of efford to get the code together and working, it's just not at all effective and would appreciate some help on the matter: Below code takes about 78 seconds to run.

daten={'Contract':Contract_Name,'Product':Products,'Relative':Relative_Name,'Start_Date':Start_Dates,'End_Date':End_Dates,'Relevant':Relevant, 'Volume':Volumes,'Hours':Hours}
df=pd.DataFrame(daten)

Matrix=[]
pt = process_time()
df['Start_Date']=pd.to_datetime(df['Start_Date'])
df['End_Date']=pd.to_datetime(df['End_Date'])
h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])

for j in h_prof['DeliveryStart']:
        Vector=[]
        idx2=h_prof['DeliveryStart'].to_list().index(j)
        for i in df['Start_Date']:
            check=False
            idx=df['Start_Date'].to_list().index(i)
            if (i<=j)and(df['End_Date'][idx]>=j):
                if df['Product'][idx]=='bunn':
                    check=True
                elif h_prof['P_OP'][idx2]=='P':
                    check=True
            if check==True:
                Vector.append(1)
            else:
                Vector.append(0)
        Matrix.append(Vector)

Thanks a lot in advance! H_prof df:

RangeIndex: 8760 entries, 0 to 8759
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Name            8760 non-null   object        
 1   Id              8760 non-null   int64         
 2   DeliveryStart   8760 non-null   datetime64[ns]
 3   DeliveryEnd     8760 non-null   datetime64[ns]
 4   Quantity        8760 non-null   float64       
 5   QuantityUnit    8760 non-null   object        
 6   Time            8760 non-null   object        
 7   Hour            8760 non-null   int64         
 8   WkDay           8760 non-null   int64         
 9   P_OP            8760 non-null   object        
dtypes: datetime64[ns](2), float64(1), int64(3), object(4)
memory usage: 684.5+ KB

Solution

  • Fast solution using numpy

    You can use numpy np.ufunc.outer and broadcasting to do your task in a really fast manner:

    import numpy as np
    
    m = (
        np.less_equal.outer(df['Start_Date'].to_numpy(), 
                            h_prof['DeliveryStart'].to_numpy())
        & np.greater_equal.outer(df['End_Date'].to_numpy(), 
                                 h_prof['DeliveryStart'].to_numpy())
        & (np.equal(df['Product'].to_numpy(), 'bunn')[:,np.newaxis]
           | np.equal(h_prof['P_OP'].to_numpy(), 'P')[np.newaxis,:])
    ).astype(int).T.tolist()
    

    The np.less_equal is the equivalent to this condition (i<=j) in your code, the np.greater_equal to this condition and(df['End_Date'][idx]>=j), the first np.equal to this condition df['Product'][idx]=='bunn' and the last np.equal is for this condition h_prof['P_OP'][idx2]=='P'. Knowing that the last 2 conditions have to be broadcasted using np.newaxis to fit in the final output dimension.

    In terms of gain, with the shapes for datasets being in my tests

    df.shape, h_prof.shape
    #((120, 3), (140, 2))
    

    you have a gain 1000 times almost with this version and yours.

    Answer improving your code

    There are improvements to be done keeping the same style of coding.

    First instead of doing a loop on the series and then get the index in the only goal to get the value in another column later:

    for j in h_prof['DeliveryStart']:
            ...
            idx2=h_prof['DeliveryStart'].to_list().index(j)
            ...
               elif h_prof['P_OP'][idx2]=='P'
    

    one can do:

    for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():
    

    which becomes interesting in terms of speed when your data is above hundreds or rows. The same can be done with the second loop:

    for i in df['Start_Date']:
        ...
        idx=df['Start_Date'].to_list().index(i)
        if ...(df['End_Date'][idx]>=j):
            if df['Product'][idx]=='bunn':
    

    that becomes

    for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():
    

    So with an overall code being:

    Matrix1=[]
    df['Start_Date']=pd.to_datetime(df['Start_Date'])
    df['End_Date']=pd.to_datetime(df['End_Date'])
    h_prof['DeliveryStart']=pd.to_datetime(h_prof['DeliveryStart'])
    
    for j, pop in h_prof[['DeliveryStart', 'P_OP']].to_numpy():
            Vector=[]
            for i, e, prod in df[['Start_Date', 'End_Date','Product']].to_numpy():
                if ((i<=j) and (e>=j)
                   and ((prod=='bunn') or (pop=='P'))): 
                    Vector.append(1)
                else: 
                    Vector.append(0)
            Matrix1.append(Vector)
    

    you already get a speed up of x15 with dataset sizes of

    df.shape, h_prof.shape
    #((120, 3), (140, 2))
    

    knowing that the gain should increase with the size.

    Last point if you want to increase gain using this style, these two conditions (prod=='bunn') or (pop=='P') are duplicated, they could be done outside of the loops.