Create a probability table based on the time of the event in Python

I have this dataset for a university project, that I arrived after some manipulation on the data:

df = d = pd.DataFrame({
'duplicates': [
     [('007', "us1", "us2", "time1", 'time2', 4)],
     [('008', "us1", "us2", "time1", 'time2', 5)],
     [('009', "us1", "us2", "time1", 'time2', 6)],
     [('007', 'us2', "us3", "time1", 'time2', 4)],
     [('008', 'us2', "us3", "time1", 'time2', 7)], 
     [('009', 'us2', "us3", "time1", 'time2', 11)], 
     [('001', 'us5', 'us1', "time1", 'time2', 0)], 
     [('008', 'us5', 'us1', "time1", 'time2', 19)], 
     [('007',"us3", "us2", "time1", 'time2', 2)],
     [('007',"us3", "us2", "time1", 'time2', 34)],
     [('009',"us3", "us2", "time1", 'time2', 67)]],
'numberOfInteractions': [1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 11]
   })

the 'duplicates' is a tuple: (ID, USER1, USER2, TIME USER1, TIME USER2, DELAY BETWEEN TIMES)

Now I have to create a probability table user x user that I did by counting the interactions, so for column us2 we have (1 + 2 + 3)/19, Na/19, (11+1+1)/19. On this case 1 + 2 + 3 are the numberOfInteractions between (df[us1,us2]) on the data (lines 0 to 2 on the first picture).

The code for that is here:

    df['duplicates'] = df.apply(
            lambda x: [(x['numberOfInteractions'],a, b, c, d, e,f) for a, b, c, d, e, f in x.duplicates], 1)


df =(pd.DataFrame(df["duplicates"].explode().tolist(),
                  columns=["numberOfInteractions", "ID","USER1","USER2","TAU1","TAU2","DELAY"])
     .groupby(["USER1","USER2"])["numberOfInteractions"]
     .agg(sum).to_frame().unstack())


df.columns = df.columns.get_level_values(1)
combined = df.index|df.columns
for col in combined:
    if col not in df.columns:
        df[col] = np.nan
    df[col] = df[col] / df[col].sum(skipna=True)

The problem here is that I want a probability based on the last part of the tuple (the DELAY BETWEEN TIMES).

So, for example, 'us5', 'us1' had two interactions, one with delay 19 and another with delay 0 (lines 6 and 7 from the first picture), therefore I want to have this probability on a tuple like (less than 5, less than 19, less than 60, less than 80, less than 98), so on this case, df['us5', 'us1'] it will be: (7/15, 8/15, 0/15, 0/15, 0/15) instead of 1 as today (because with my algorithm is adding (8+7)/15, so it's 1).

This is the idea but I don't even know how to start.

Solution

I think you have two ways to go.

Either you go with a new column based on delay and numberOfInteractions (what I would do) :

def mapToNbOfInteractionsPerDelay(group):
    nbOfInteractions = group['numberOfInteractions']
    delay = group['DELAY']

    if(delay <= 5):
        return (nbOfInteractions, 0, 0, 0, 0)
    elif(delay <= 19):
        return (0, nbOfInteractions, 0, 0, 0)
    elif(delay <= 60):
        return (0, 0, nbOfInteractions, 0, 0)
    elif(delay <= 80):
        return (0, 0, 0, nbOfInteractions, 0)
    else:
        return (0, 0, 0, 0, nbOfInteractions)


df["nbOfInteractionsPerDelay"] = df[["DELAY", "numberOfInteractions"]].apply(mapToNbOfInteractionsPerDelay, axis=1)

Then you can go for :

df = (df.groupby(["USER1","USER2"])["nbOfInteractionsPerDelay"]
        .agg(lambda l : tuple([sum(x) for x in zip(*l)])).to_frame().unstack())

which will then give you this :

      nbOfInteractionsPerDelay                                    
USER2                      us1               us2               us3
USER1                                                            
us1                        NaN   (3, 3, 0, 0, 0)               NaN
us2                        NaN               NaN  (4, 11, 0, 0, 0)
us3                        NaN  (1, 0, 1, 11, 0)               NaN
us5            (7, 8, 0, 0, 0)               NaN               NaN

From there, you can easily have what you expect.

Or either you split your dataframe into 5 other dataframes, with each dataframe having values for a specific delay subset, and then you merge.