Search code examples
pythonpandasnumpydataframefrequency

using groupby attribute in pandas


I have a DataFrame with 3 columns (A,B,C) and a large number of rows. There are different types of elements in each of these columns: A1,A2... B1,B2... and C1, C2... respectively.

I want to find the number of times a particular combination (say (A1,B2,C2)) occurs in a row. Then I want to generate a (summary) list of all combinations whose frequency-counts exceed a fixed threshold. Example: combinations number of counts (A1,B2,C2) 5 (A2,B2,C2) 7 .... if the fixed value is 6.

I am new to pandas and numpy. Can this be done efficiently using pandas, and if so how?


Solution

  • df = pd.DataFrame({'A':['A1','A1','A2','A3'],
                       'B':[4,4,6,4],
                       'C':[7,7,9,7]})
    
    print (df)
        A   B  C
    0   4   7  C1
    1   4   7  C1
    2   6   9  C2
    3   4   7  C3
    

    For count of all combination use groupby + size:

    s = df.groupby(["A", "B","C"]).size()
    print (s)
    A   B   C
    A1  4   7     2
    A2  6   9     1
    A3  4   7     1
    dtype: int64
    

    For filtered list by value add boolean indexing:

    L = s.index[s > 1].tolist()
    print (L)
    [('A1', 4, 7)]