Search code examples
pythonpandasdataframesplitlogic

Splitting values from one data frame to another data frame based on certain conditions in another data frame in pandas


I have two data frames df1 and df2, and I want to adjust values from df2 to df1 based on conditions in df1. The conditions are based on 4 different columns and different conditions for different IDs in df1 and I need to put values of one column from df2 in such a way that it splits the value from df2 and adjust it in df1, and the sum of each IDs value should match in both the data frames.

So I have the following format of data:

enter image description here

enter image description here

I want to bring values from df2 to df1 and split it according to Start Day End Day, Start Time and End Time in df1 itself and the sum should be equal in both df1 and df2 for each ID.

Expected Output enter image description here

here is the same data frame created in pandas. these two tables are input values and I want the expected result as above.

df1 = pd.DataFrame({'ID': ["Ch1","Ch1","Ch1","Ch1","Ch1","Ch1","Ch2","Ch2","Ch2"],
               'Start Day': [1,1,1,6,6,6,1,1,1], 
               'End Day': [5,5,5,7,7,7,7,7,7], 
               'Start Time': [600,1200,1700,600,1200,1700,700,1200,1700], 
               'End Time': [1200,1700,2500,1200,1700,2500,1200,1700,2400]})
print(df1)

df2 = pd.DataFrame({'ID': ["Ch1","Ch1","Ch1","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2","Ch2"],
                    'Start Day': [1,1,1,1,1,1,1,1,1,1,1,1,6,1],
                    'End Day': [7,7,7,5,5,5,5,5,5,5,5,5,7,7],
                    'Start Time': [600,1200,1700,800,900,1000,1100,1200,1300,1900,2000,2200,700,700],
                    'End Time': [1200,1700,2500,900,1000,1100,1200,1300,1400,2000,2200,2300,2400,2400],
                    'Values':[1125,2250,1125,346.5,346.5,346.5,346.5,346.5,346.5,189,189,346.5,1795.5,346.5]})
print(df2)

Can somebody please help me with this.

Calculations:

from df2 to df1 say in df2 I have values 1125 for day 1 to 7 and time band 600 to 1200 at ch1 I would want to split that value into day 1 to 5 and day 6 to 7 in df1, So to put value 1125 in 5 to 7 days and in the same time band I will divide the value by 7 and multiply it by 5 and as the time band is same i.e. 600 to 1200 in both the dataframes I will not divide any further and keep the value (1125/7*5) or (1125/df2(End Day- Start Day+1)*df1(End Day- Start Day+1)) into df1 where Start Day and End day is 1 to 5 respectively and time bands are 600 to 1200 and that value would be 803.6 out of 1125,

Similarly, for day 6 to day 7 in df1 we will split the Value of df2 in the same manner, from df2 Value column we will do this calculation: (1125/7*2) or (1125/df2(End Day- Start Day+1)*df1(End Day- Start Day+1)) into df1 where Start Day and End day is 6 to 7 respectively and time bands are 600 to 1200 and that value would be 321.4 out of 1125.

If in case Timeband changes in df1 we will add calculations in the Values, say in df1 I want start day and end day as 1 to 5 and time bands 700 to 1100 then I will put the values from df2 to df1 in the following manner:

(1125/7*5)*6/4 or (1125/df2(End Day- Start Day+1)*df1(End Day- Start Day+1))*df2(End Time-Start Time)/df1(End Time - Start Time)

Also, if from df2 we have start day end day 1 to 7 and start time and end time 600 to 1200 as 1125 and in df1 we have start day and end date 1 to 5 and Start time 700 to 1100 only and no other day band or timeband in any row, then in that case keep the whole 1125 value into that row of df1 itself.

please help me with this code and logic I will be really great full. Thanks in advance.


Solution

  • Updated answer, after further clarification on objective:

    These results match the example you provided for what you are looking for, so I feel confident that we are on the same page now.
    This may be a bit slow if you apply it to a huge data set, because it calls on the DataFrame.apply() function twice, to iterate through each row of df1 and for each row of df1 it iterates through each row of df2.

    I tried to catch each scenario of overlap between the day/time blocks that would require the returned value to be determined differently. You need to check that I didn't miss any other scenarios / edge cases.

    Here is the solution:

    def getDF2ValueForTimeBlock(df1row, df2row):
        if df2row["ID"] == df1row["ID"]:
            
            #Case 1: df2 window entirely contained within df1 window
            if (
                (df2row["Start Day"] >= df1row["Start Day"]) 
                & (df2row["End Day"] <= df1row["End Day"])
                & (df2row["Start Time"] >= df1row["Start Time"]) 
                & (df2row["End Time"] <= df1row["End Time"])
            ):    
                return df2row["Values"]
            
            #Case 2: df1 window entirely contained within df2 window
            elif (
                (df2row["Start Day"] <= df1row["Start Day"]) 
                & (df2row["End Day"] >= df1row["End Day"])
                & (df2row["Start Time"] <= df1row["Start Time"]) 
                & (df2row["End Time"] >= df1row["End Time"])
            ):    
                #Return only proportion of df2 values after scaling down to span of df1
                dayspanratio = (df1row["End Day"] - df1row["Start Day"] + 1) / (df2row["End Day"] - df2row["Start Day"] + 1)
                hourspanratio = (df1row["End Time"] - df1row["Start Time"]) / (df2row["End Time"] - df2row["Start Time"])
                return df2row["Values"] * dayspanratio * hourspanratio
            
            
            #Case 3: partial overlap on Days, df2 time completely within df1 time boundaries
            elif(
                (
                    (df1row["Start Day"] <= df2row["Start Day"] <= df1row["End Day"])
                    | (df1row["Start Day"] <= df2row["End Day"] <= df1row["End Day"])
                )
                &(
                    (df1row["Start Time"] <= df2row["Start Time"] <= df2row["End Time"] <= df1row["End Time"])
                )
            ):
                #Find proportion of df2 values allocable to overlapping width of df1 window 
                maxStartDay = max([df1row["Start Day"], df2row["Start Day"]])
                minEndDay = min([df1row["End Day"], df2row["End Day"]])
                
                dayspanratio = (minEndDay - maxStartDay + 1) / (df2row["End Day"] - df2row["Start Day"] + 1)
                hourspanratio = 1
                return df2row["Values"] * dayspanratio * hourspanratio
            
            
            
            #Case 4: df2 window partially overlapping with df1 window on both Days and Time
            elif(
                (
                    (df1row["Start Day"] <= df2row["Start Day"] <= df1row["End Day"])
                    | (df1row["Start Day"] <= df2row["End Day"] <= df1row["End Day"])
                )
                &(
                    (df1row["Start Time"] <= df2row["Start Time"] <= df1row["End Time"])
                    | (df1row["Start Time"] <= df2row["End Time"] <= df1row["End Time"])
                    #for df2 time extending beyond df1 time span on both boundaries:
                    | (df2row["Start Time"] <= df1row["Start Time"] <= df1row["End Time"] <= df2row["End Time"]) 
                )
            ):
                #Find proportion of df2 values allocable to overlapping width of df1 window 
                maxStartDay = max([df1row["Start Day"], df2row["Start Day"]])
                minEndDay = min([df1row["End Day"], df2row["End Day"]])
                
                maxStartTime = max([df1row["Start Time"], df2row["Start Time"]])
                minEndTime = min([df1row["End Time"], df2row["End Time"]])
                
                dayspanratio = (minEndDay - maxStartDay + 1) / (df2row["End Day"] - df2row["Start Day"] + 1)
                hourspanratio = (minEndTime - maxStartTime) / (df2row["End Time"] - df2row["Start Time"])
                return df2row["Values"] * dayspanratio * hourspanratio
            
            
            #Case 5: Channel ID matches, but no overlap in both days and time windows
            else:
                return 0
            
        else:
            #Case Different Channel
            return 0
    
    
    df1["Values"] = df1.apply(
        lambda d1row: df2.apply(lambda d2row: getDF2ValueForTimeBlock(d1row, d2row), axis=1).sum(), axis=1
    )
    
    print(df1)
    

    OUTPUT:

        ID  Start Day  End Day  Start Time  End Time       Values
    0  Ch1          1        5         600      1200   803.571429
    1  Ch1          1        5        1200      1700  1607.142857
    2  Ch1          1        5        1700      2500   803.571429
    3  Ch1          6        7         600      1200   321.428571
    4  Ch1          6        7        1200      1700   642.857143
    5  Ch1          6        7        1700      2500   321.428571
    6  Ch2          1        7         700      1200  2016.000000
    7  Ch2          1        7        1200      1700  1323.000000
    8  Ch2          1        7        1700      2400  1606.500000