I have this dataset for a university project, that I arrived after some manipulation on the data:
df = d = pd.DataFrame({
'duplicates': [
[('007', "us1", "us2", "time1", 'time2', 4)],
[('008', "us1", "us2", "time1", 'time2', 5)],
[('009', "us1", "us2", "time1", 'time2', 6)],
[('007', 'us2', "us3", "time1", 'time2', 4)],
[('008', 'us2', "us3", "time1", 'time2', 7)],
[('009', 'us2', "us3", "time1", 'time2', 11)],
[('001', 'us5', 'us1', "time1", 'time2', 0)],
[('008', 'us5', 'us1', "time1", 'time2', 19)],
[('007',"us3", "us2", "time1", 'time2', 2)],
[('007',"us3", "us2", "time1", 'time2', 34)],
[('009',"us3", "us2", "time1", 'time2', 67)]],
'numberOfInteractions': [1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 11]
})
the 'duplicates' is a tuple: (ID, USER1, USER2, TIME USER1, TIME USER2, DELAY BETWEEN TIMES)
Now I have to create a probability table user x user that I did by counting the interactions, so for column us2
we have (1 + 2 + 3)/19, Na/19, (11+1+1)/19. On this case 1 + 2 + 3
are the numberOfInteractions
between (df[us1,us2])
on the data (lines 0 to 2 on the first picture).
The code for that is here:
df['duplicates'] = df.apply(
lambda x: [(x['numberOfInteractions'],a, b, c, d, e,f) for a, b, c, d, e, f in x.duplicates], 1)
df =(pd.DataFrame(df["duplicates"].explode().tolist(),
columns=["numberOfInteractions", "ID","USER1","USER2","TAU1","TAU2","DELAY"])
.groupby(["USER1","USER2"])["numberOfInteractions"]
.agg(sum).to_frame().unstack())
df.columns = df.columns.get_level_values(1)
combined = df.index|df.columns
for col in combined:
if col not in df.columns:
df[col] = np.nan
df[col] = df[col] / df[col].sum(skipna=True)
The problem here is that I want a probability based on the last part of the tuple (the DELAY BETWEEN TIMES).
So, for example, 'us5', 'us1'
had two interactions, one with delay 19 and another with delay 0 (lines 6 and 7 from the first picture), therefore I want to have this probability on a tuple like (less than 5, less than 19, less than 60, less than 80, less than 98)
, so on this case, df['us5', 'us1'] it will be: (7/15, 8/15, 0/15, 0/15, 0/15) instead of 1 as today (because with my algorithm is adding (8+7)/15, so it's 1).
This is the idea but I don't even know how to start.
I think you have two ways to go.
Either you go with a new column based on delay and numberOfInteractions (what I would do) :
def mapToNbOfInteractionsPerDelay(group):
nbOfInteractions = group['numberOfInteractions']
delay = group['DELAY']
if(delay <= 5):
return (nbOfInteractions, 0, 0, 0, 0)
elif(delay <= 19):
return (0, nbOfInteractions, 0, 0, 0)
elif(delay <= 60):
return (0, 0, nbOfInteractions, 0, 0)
elif(delay <= 80):
return (0, 0, 0, nbOfInteractions, 0)
else:
return (0, 0, 0, 0, nbOfInteractions)
df["nbOfInteractionsPerDelay"] = df[["DELAY", "numberOfInteractions"]].apply(mapToNbOfInteractionsPerDelay, axis=1)
Then you can go for :
df = (df.groupby(["USER1","USER2"])["nbOfInteractionsPerDelay"]
.agg(lambda l : tuple([sum(x) for x in zip(*l)])).to_frame().unstack())
which will then give you this :
nbOfInteractionsPerDelay
USER2 us1 us2 us3
USER1
us1 NaN (3, 3, 0, 0, 0) NaN
us2 NaN NaN (4, 11, 0, 0, 0)
us3 NaN (1, 0, 1, 11, 0) NaN
us5 (7, 8, 0, 0, 0) NaN NaN
From there, you can easily have what you expect.
Or either you split your dataframe into 5 other dataframes, with each dataframe having values for a specific delay subset, and then you merge.