Search code examples
pythonpandaspython-itertools

How to count paired occurence of items within subgroups in a pandas dataframe?


Let's say I have a dataframe such as this one:

import pandas as pd
data = {'group':  ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
        'item': ['Apple', 'Chocolate', 'Beans', 'Apple', 'Beans', 'Banana', 'Banana', 'Chocolate', 'Banana', 'Orange', 'Apple', 'Apple' ]
}

df = pd.DataFrame(data, columns = ['group','item'])
df

0   a   Apple
1   a   Chocolate
2   a   Beans
3   a   Apple
4   b   Beans
5   b   Banana
6   b   Banana
7   c   Chocolate
8   c   Banana
9   c   Orange
10  c   Apple
11  c   Apple
        

How one would count the total number of paired occurrence that are within groups?

I would like to get a table with each combination of items that arise in the groups and get their frequencies. If there are two same item under one group, i would like to keep the combinaison of the item with itself but only count it as one when matched with other items.

Ideally, I would like to get a dataframe with items name on both axis, rows and columns and have half the matrix with corresponding values.

A dataframe with combination in one column and value on another would do perfectly too.

The above example would translate into this:

apple - chocolate   2
apple - apple       2
apple - beans       1
apple - Orange      1
apple - Banana      1
chocolate - Beans   1
beans - banana      1
banana - banana     1
chocolate - banana  1
chocolate - orange  1
banana - orange     1

So I thought I would first use combination and counter without duplicates, then add the items matching themselves later on. The reason is to avoid counting several matches for multiple items combos under the same group. However, when I try this, some of the combination arise twice in different orders such as apple - chocolate and chocolate - apple. Here is my code:

from collections import Counter
from itertools import combinations
df = df.groupby('group').filter(lambda g: len(g) > 1).drop_duplicates(subset=['group', 'item'], keep="first")

result = df.groupby(['group']).agg(lambda g: list(set(combinations(g, 2))))

combos = pd.DataFrame(Counter(result.item.sum()).items(), columns=['combos', 'count'])
combos 

     combos                 count
0   (Apple, Beans)          1
1   (Apple, Chocolate)      1
2   (Chocolate, Beans)      1
3   (Beans, Banana)         1
4   (Chocolate, Orange)     1
5   (Orange, Apple)         1
6   (Banana, Orange)        1
7   (Banana, Apple)         1
8   (Chocolate, Banana)     1
9   (Chocolate, Apple)      1

Please, help me!


Solution

  • Minor tweaks to your code will solve your problem

    ## No need to drop to duplicates as you create a set of the combinations, so they won't be counted twice
    # df = df.groupby('group').filter(lambda g: len(g) > 1).drop_duplicates(subset=['group', 'item'], keep="first")
    
    ## Sorted g, so the problem with tuples that are ordered differently is solved
    result = df.groupby(['group']).agg(lambda g: list(set(combinations(sorted(g), 2))))
    
    combos = pd.DataFrame(Counter(result.item.sum()).items(), columns=['combos', 'count'])
    combos 
    
                     combos  count
    0        (Apple, Beans)      1
    1    (Beans, Chocolate)      1
    2        (Apple, Apple)      2
    3    (Apple, Chocolate)      2
    4      (Banana, Banana)      1
    5       (Banana, Beans)      1
    6       (Apple, Banana)      1
    7      (Banana, Orange)      1
    8       (Apple, Orange)      1
    9   (Chocolate, Orange)      1
    10  (Banana, Chocolate)      1
    
    

    Marked my changes with ##

    You can of course easily order the results afterwords using

    combos.sort_values('count', ascending=False)