Search code examples
pythonpandasgrouping

Column wise count in Pandas DF


I have this DF:

A B C D COLUMNS
1 ABC ABC 2,3
DEF ABC DEF 2,4
LSD LSD LSD 2,4
ERR ERR 2,3
ERR ABC LSD 2,4

Expected Output

A-1 B-2 C-3 D-4 F-5 COLUMNS INDEX ABC DEF ERR LSD
1 ABC ABC 2,3 2 0 0 0
DEF ABC DEF DEF 2,4 1 2 0 0
LSD LSD LSD LSD 2,3 0 0 0 2
ERR ERR 2,3 0 0 2 0
ERR ABC LSD LSD 2,4 1 0 1 1

I HAVE CREATED THE COLUMN INDEX FOR SUMMING BUT I COULDN'T ABLE TO LOOP EACH COLUMNS INDED AND GET THE INTENDED OUTPUT.


Solution

  • stack the relevant columns, then use a crosstab and join back to the original DataFrame:

    s = df.drop(columns=['A', 'COLUMNS']).stack().droplevel(1)
    
    out = df.join(pd.crosstab(s.index, s))
    

    If you want to use the boundaries from the "COLUMNS" column, you can combine it with to build a mask:

    import numpy as np
    
    # build the mask
    limits = df['COLUMNS'].str.split(',', expand=True).astype(int)
    a = np.arange(df.shape[1])
    m1 = (a+1>=limits[0].to_numpy()[:, None])
    m2 = (a<limits[1].to_numpy()[:, None])
    
    # apply the mask and stack
    s = df.where(m1 & m2).drop(columns=['A', 'COLUMNS']).stack().droplevel(1)
    
    # count
    out = df.join(pd.crosstab(s.index, s))
    

    Output:

         A    B    C    D COLUMNS  ABC  DEF  ERR  LSD
    0  1.0  ABC  ABC  NaN     2,3    2    0    0    0
    1  NaN  DEF  ABC  DEF     2,4    1    2    0    0
    2  NaN  LSD  LSD  LSD     2,4    0    0    0    3
    3  NaN  ERR  ERR  NaN     2,3    0    0    2    0
    4  NaN  ERR  ABC  LSD     2,4    1    0    1    1