I need a pandas solution for the next question. For example I have a table:
| id | element | dt |
| -------- | -------- | -------- |
| 1 | a | 22/04/22 |
| 2 | a | 22/04/22 |
| 1 | b | 27/04/22 |
| 1 | a | 23/04/22 |
| 3 | b | 22/04/22 |
| 1 | a | 22/04/22 |
| 1 | a | 22/04/22 |
| 3 | b | 23/04/22 |
| 3 | b | 25/04/22 |
| 1 | a | 27/04/22 |
| 1 | c | 26/04/22 |
| 1 | d | 26/04/22 |
| 1 | g | 25/04/22 |
| 1 | b | 27/04/22 |
I want to find the percentage of similar items that the user (id) sees in one week (but not counting on the same day). For example, for this table, the answer is:
| id | percentage |
| -------- | --------- |
| 1 | 0.3 |
| 2 | 0.0 |
| 3 | 1 |
for id 1 answer is 0.3, because if we are taking a window for one week (7 days) he has seen an element "a" 3 times at 22/04/22, 23/04/22 and 27/04/22. He seen "a" at 22/04/22 a few times, but we don't counting it. At he seen 10 times at total. So answer is 3/10 = 0.3
Tried to do it by myself, but don't know exactly how window in Pandas is working.
IIUC, you need to use a double groupby
. Once to compute the number of unique values and size of each group per id/element. Then a second one per ID to sum the counts greater than 1 and divide by the sum of sizes:
(df.groupby(['id', 'element'])['dt']
.agg(['nunique', 'size'])
.assign(nunique=lambda d: d['nunique'].where(d['nunique'].gt(1), 0))
.groupby(level='id').sum()
.eval('nunique/size')
.reset_index(name='percentage')
)
Output:
id percentage
0 1 0.3
1 2 0.0
2 3 1.0
If you want to include the week, relative to the first day:
(df.assign(week=df['dt'].sub(df['dt'].min()).dt.days.floordiv(7))
.groupby(['week', 'id', 'element'])['dt']
.agg(['nunique', 'size'])
.assign(nunique=lambda d: d['nunique'].where(d['nunique'].gt(1), 0))
.groupby(level=['week', 'id']).sum()
.eval('nunique/size')
.reset_index(name='percentage')
)
Alternative using a custom function:
def similar(df):
g = df.groupby('element')['dt']
count = g.nunique()
return count[count>1].sum()/g.size().sum()
out = (
df.assign(week=df['dt'].sub(df['dt'].min()).dt.days.floordiv(7))
.groupby(['week', 'id'])
.apply(similar)
.reset_index(name='percent')
)
Output:
week id percent
0 0 1 0.3
1 0 2 0.0
2 0 3 1.0