Search code examples
pythonpandasdataframedatasetseries

How can I find pairs and triplets of values in a pandas dataframe


I have a pandas dataframe in Python that contains the following pair of columns.

I need to count how many times pairs and triplets of combination of data appear with and without considering the order. As an example, let's say that I have a dataframe with two columns, Classification and Individual and the following token data

data = {

    'Classification': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5],
    'Individual': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'A', 'A', 'B', 'B', 'B']
}

Now, I want to arrive to the following results

Clasification   ValueSeries TimesClassification PercentageClassification    

1   AB  5   1
2   AB  5   1
3   AC  2   0.4
3   AB  5   1   
3   ABC 3   0.6
4   AB  5   1
4   BC  2   0.4
4   ABC 3   0.6
5   AC  2   0.4
5   AB  5   1
5   ABC 3   0.6

this is, for each value of clasification the unnordered pairs and triplets contained within.


Solution

  • The exact logic is not fully clear, but you can use itertools to produce the combinations of Classification, then apply a value_counts and groupby.transform to compute the counts:

    from itertools import chain, combinations
    
    def powerset(s):
        s = set(s)
        return list(chain.from_iterable(combinations(s, r)
                                        for r in range(2, len(s)+1))
                   )
    
    out = df.groupby('Classification')['Individual'].agg(powerset).explode()
    
    out = (out
        .reset_index(name='ValueSeries')
        .merge(out.value_counts().rename('TimesClassification'),
               how='left',
               left_on='ValueSeries', right_index=True)
        .assign(PercentageClassification=lambda d: d['TimesClassification']
                / d.groupby('Classification')['TimesClassification'].transform('max')
               )
    )
    

    Output:

        Classification ValueSeries  TimesClassification  PercentageClassification
    0                1      (A, B)                    5                       1.0
    1                2      (A, B)                    5                       1.0
    2                3      (C, A)                    3                       0.6
    3                3      (C, B)                    3                       0.6
    4                3      (A, B)                    5                       1.0
    5                3   (C, A, B)                    3                       0.6
    6                4      (C, A)                    3                       0.6
    7                4      (C, B)                    3                       0.6
    8                4      (A, B)                    5                       1.0
    9                4   (C, A, B)                    3                       0.6
    10               5      (C, A)                    3                       0.6
    11               5      (C, B)                    3                       0.6
    12               5      (A, B)                    5                       1.0
    13               5   (C, A, B)                    3                       0.6