Search code examples
pythonpython-3.xpandasdataframepandas-groupby

How to count the occurances of elements in list in for a row in pandas


I have a df that looks like this. it is a multi-index df resulting from a group-by

grouped = df.groupby(['chromosome', 'start_pos', 'end_pos',
                      'observed']).agg(lambda x: x.tolist())
                                          reference         zygosity    
chromosome  start_pos   end_pos observed                                            
chr1            69428   69428       G       [T, T]          [hom, hom]      
                69511   69511       G       [A, A]          [hom, hom]      
                762273  762273      A       [G, G, G]       [hom, het, hom] 
                762589  762589      C       [G]             [hom]       
                762592  762592      G       [C]             [het]       

For each row i want to count the number of het and hom in the zygosity. and make a new column called 'count_hom' and 'count_het'

I have tried using for loop it is slow and not very reliable with changing data. Is there a way to do this using something like df.zygosity.len().sum() but only for het or only for hom


Solution

  • Instead of working on groupby result, you could adjust your groupby construction a bit by including a lambda to agg that counts "het" and "hom" values for each group at the time you build grouped:

    grouped = (df.groupby(['chromosome', 'start_pos', 'end_pos','observed'])
               .agg(reference=('reference', list), 
                    zygosity=('zygosity', list), 
                    count_het=('zygosity', lambda x: x.eq('het').sum()),
                    count_hom=('zygosity', lambda x: x.eq('hom').sum())))
    

    If you want to create a list out of all lists, you could use the following:

    cols = ['chromosome', 'start_pos', 'end_pos','observed']
    out = df.groupby(cols).agg(**{c: (c, list) for c in df.columns.drop('reference')}, 
                               count_het=('zygosity', lambda x: x.eq('het').sum()),
                               count_hom=('zygosity', lambda x: x.eq('hom').sum()))