Search code examples
pythonpandasprobability

Create a probability table based on the time of the event in Python


I have this dataset for a university project, that I arrived after some manipulation on the data:

df = d = pd.DataFrame({
'duplicates': [
     [('007', "us1", "us2", "time1", 'time2', 4)],
     [('008', "us1", "us2", "time1", 'time2', 5)],
     [('009', "us1", "us2", "time1", 'time2', 6)],
     [('007', 'us2', "us3", "time1", 'time2', 4)],
     [('008', 'us2', "us3", "time1", 'time2', 7)], 
     [('009', 'us2', "us3", "time1", 'time2', 11)], 
     [('001', 'us5', 'us1', "time1", 'time2', 0)], 
     [('008', 'us5', 'us1', "time1", 'time2', 19)], 
     [('007',"us3", "us2", "time1", 'time2', 2)],
     [('007',"us3", "us2", "time1", 'time2', 34)],
     [('009',"us3", "us2", "time1", 'time2', 67)]],
'numberOfInteractions': [1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 11]
   })

enter image description here

the 'duplicates' is a tuple: (ID, USER1, USER2, TIME USER1, TIME USER2, DELAY BETWEEN TIMES)

Now I have to create a probability table user x user that I did by counting the interactions, so for column us2 we have (1 + 2 + 3)/19, Na/19, (11+1+1)/19. On this case 1 + 2 + 3 are the numberOfInteractions between (df[us1,us2]) on the data (lines 0 to 2 on the first picture).

enter image description here

The code for that is here:

    df['duplicates'] = df.apply(
            lambda x: [(x['numberOfInteractions'],a, b, c, d, e,f) for a, b, c, d, e, f in x.duplicates], 1)


df =(pd.DataFrame(df["duplicates"].explode().tolist(),
                  columns=["numberOfInteractions", "ID","USER1","USER2","TAU1","TAU2","DELAY"])
     .groupby(["USER1","USER2"])["numberOfInteractions"]
     .agg(sum).to_frame().unstack())


df.columns = df.columns.get_level_values(1)
combined = df.index|df.columns
for col in combined:
    if col not in df.columns:
        df[col] = np.nan
    df[col] = df[col] / df[col].sum(skipna=True)

The problem here is that I want a probability based on the last part of the tuple (the DELAY BETWEEN TIMES).

So, for example, 'us5', 'us1' had two interactions, one with delay 19 and another with delay 0 (lines 6 and 7 from the first picture), therefore I want to have this probability on a tuple like (less than 5, less than 19, less than 60, less than 80, less than 98), so on this case, df['us5', 'us1'] it will be: (7/15, 8/15, 0/15, 0/15, 0/15) instead of 1 as today (because with my algorithm is adding (8+7)/15, so it's 1).

This is the idea but I don't even know how to start.


Solution

  • I think you have two ways to go.

    Either you go with a new column based on delay and numberOfInteractions (what I would do) :

    def mapToNbOfInteractionsPerDelay(group):
        nbOfInteractions = group['numberOfInteractions']
        delay = group['DELAY']
    
        if(delay <= 5):
            return (nbOfInteractions, 0, 0, 0, 0)
        elif(delay <= 19):
            return (0, nbOfInteractions, 0, 0, 0)
        elif(delay <= 60):
            return (0, 0, nbOfInteractions, 0, 0)
        elif(delay <= 80):
            return (0, 0, 0, nbOfInteractions, 0)
        else:
            return (0, 0, 0, 0, nbOfInteractions)
    
    
    df["nbOfInteractionsPerDelay"] = df[["DELAY", "numberOfInteractions"]].apply(mapToNbOfInteractionsPerDelay, axis=1)
    

    Then you can go for :

    df = (df.groupby(["USER1","USER2"])["nbOfInteractionsPerDelay"]
            .agg(lambda l : tuple([sum(x) for x in zip(*l)])).to_frame().unstack())
    

    which will then give you this :

          nbOfInteractionsPerDelay                                    
    USER2                      us1               us2               us3
    USER1                                                            
    us1                        NaN   (3, 3, 0, 0, 0)               NaN
    us2                        NaN               NaN  (4, 11, 0, 0, 0)
    us3                        NaN  (1, 0, 1, 11, 0)               NaN
    us5            (7, 8, 0, 0, 0)               NaN               NaN
    

    From there, you can easily have what you expect.

    Or either you split your dataframe into 5 other dataframes, with each dataframe having values for a specific delay subset, and then you merge.