Search code examples
pythonpandas

Find duplicates from pandas column of nested lists within previous rows with multiple conditions


I'm a little confused on how to code this.

I have a dataset like this:

rules   user_list       event_time          row_number
rule1   123,244,344     2024-09-20          1
rule1   125,346,421     2024-09-19          2
rule1   125,343,431     2024-09-18          3
rule2   125,344,423     2024-09-20          1
rule2   125,346,421     2024-09-19          2
rule3   125,348,331     2024-09-20          1
rule3   125,336,221     2024-09-19          2
reproducible df
data = {
    'rules': ['rule1', 'rule1', 'rule1', 'rule2', 'rule2', 'rule3', 'rule3'],
    'user_list': ['123,244,344', '125,346,421', '125,343,431', '125,344,423', '125,346,421', '125,348,331', '125,336,221'],
    'event_time': ['2024-09-20', '2024-09-19', '2024-09-18', '2024-09-20', '2024-09-19', '2024-09-20', '2024-09-19'],
    'row_number': [1, 2, 3, 1, 2, 1, 2]
}
data = pd.DataFrame(data)
data['event_time'] = pd.to_datetime(data['event_time'])

I am trying to build another column that counts/finds the number of user_ids from the latest rule rows (where row_number = 1) that are in other rows within the past day and where the rule is a different rule (so count duplicate users that fired on different rules within the past day).

the final table should look like this:

rules   user_list       event_time          row_number      dupe_users
rule1   123,244,344     2024-09-20          1               344
rule1   125,346,421     2024-09-19          2               125,125,346,421
rule1   125,343,431     2024-09-18          3               125
rule2   125,344,423     2024-09-20          1               125,344
rule2   125,346,421     2024-09-19          2               125,125,346,421
rule3   125,348,331     2024-09-20          1               125,125
rule3   125,336,221     2024-09-19          2               125,125

ex: user 344 was seen on rule1 2024-09-20 and on rule2 on 2024-09-20.


Solution

  • I'm not sure I get the full logic, but as far as I understand, you could use a groupby.transform with a custom function and help from collections.Counter:

    from collections import Counter
    
    def f(s):
        cnts = [Counter(x.split(',')) for x in s]
        ref = sum(cnts, start=Counter())
        out = []
        for x in cnts:
            diff = ref-x
            out.append(','.join(y for val in x for y in [val]*(diff[val])))
        return out
    
    data['dupe_users'] = data.groupby('event_time')['user_list'].transform(f)
    

    Output:

       rules    user_list event_time  row_number       dupe_users
    0  rule1  123,244,344 2024-09-20           1              344
    1  rule1  125,346,421 2024-09-19           2  125,125,346,421
    2  rule1  125,343,431 2024-09-18           3                 
    3  rule2  125,344,423 2024-09-20           1          125,344
    4  rule2  125,346,421 2024-09-19           2  125,125,346,421
    5  rule3  125,348,331 2024-09-20           1              125
    6  rule3  125,336,221 2024-09-19           2          125,125