I'm a little confused on how to code this.
I have a dataset like this:
rules user_list event_time row_number
rule1 123,244,344 2024-09-20 1
rule1 125,346,421 2024-09-19 2
rule1 125,343,431 2024-09-18 3
rule2 125,344,423 2024-09-20 1
rule2 125,346,421 2024-09-19 2
rule3 125,348,331 2024-09-20 1
rule3 125,336,221 2024-09-19 2
data = {
'rules': ['rule1', 'rule1', 'rule1', 'rule2', 'rule2', 'rule3', 'rule3'],
'user_list': ['123,244,344', '125,346,421', '125,343,431', '125,344,423', '125,346,421', '125,348,331', '125,336,221'],
'event_time': ['2024-09-20', '2024-09-19', '2024-09-18', '2024-09-20', '2024-09-19', '2024-09-20', '2024-09-19'],
'row_number': [1, 2, 3, 1, 2, 1, 2]
}
data = pd.DataFrame(data)
data['event_time'] = pd.to_datetime(data['event_time'])
I am trying to build another column that counts/finds the number of user_ids from the latest rule rows (where row_number = 1) that are in other rows within the past day and where the rule is a different rule (so count duplicate users that fired on different rules within the past day).
the final table should look like this:
rules user_list event_time row_number dupe_users
rule1 123,244,344 2024-09-20 1 344
rule1 125,346,421 2024-09-19 2 125,125,346,421
rule1 125,343,431 2024-09-18 3 125
rule2 125,344,423 2024-09-20 1 125,344
rule2 125,346,421 2024-09-19 2 125,125,346,421
rule3 125,348,331 2024-09-20 1 125,125
rule3 125,336,221 2024-09-19 2 125,125
ex: user 344 was seen on rule1 2024-09-20 and on rule2 on 2024-09-20.
I'm not sure I get the full logic, but as far as I understand, you could use a groupby.transform
with a custom function and help from collections.Counter
:
from collections import Counter
def f(s):
cnts = [Counter(x.split(',')) for x in s]
ref = sum(cnts, start=Counter())
out = []
for x in cnts:
diff = ref-x
out.append(','.join(y for val in x for y in [val]*(diff[val])))
return out
data['dupe_users'] = data.groupby('event_time')['user_list'].transform(f)
Output:
rules user_list event_time row_number dupe_users
0 rule1 123,244,344 2024-09-20 1 344
1 rule1 125,346,421 2024-09-19 2 125,125,346,421
2 rule1 125,343,431 2024-09-18 3
3 rule2 125,344,423 2024-09-20 1 125,344
4 rule2 125,346,421 2024-09-19 2 125,125,346,421
5 rule3 125,348,331 2024-09-20 1 125
6 rule3 125,336,221 2024-09-19 2 125,125