Search code examples
pythonpandasgroup-by

Pandas groupby: add suffix to elements which are identical across groups


I have a dataframe like this:

                                peakID cytoband      start        end    length  10.388_116  10.193_156  10.401_184  10.214_385
0  Amp_2q37.3_chr2:237990001-242193529   2q37.3  237990001  242193529   4203528           1           0           0           0
1  Del_2q37.3_chr2:226990001-242193529   2q37.3  226990001  242193529  15203528          -1           0           0           0

Notice how peakID is different, but cytoband is not. I need to unpivot this table (using a function from pyjanitor) without keeping peakID. Currently I do:

import pandas as pd
import pyjanitor
from natsort import natsort_keygen

table = (
        table
        .drop(columns="peakID")
       .pivot_longer(index=["cytoband", "start", "end", "length"],
                     names_to="sample", values_to="state")
            .sort_values(["cytoband", "sample"], key=natsort_keygen())
            .remove_columns(["length", "start", "end"])
            .set_index("cytoband")
        )

And the end result looks like this:

table.loc["2q37.3", :]
Out[36]: 
              sample  state
cytoband                   
2q37.3    10.193_156      0
2q37.3    10.193_156      0
2q37.3    10.214_385      0
2q37.3    10.214_385      0
2q37.3    10.388_116      1
2q37.3    10.388_116     -1
2q37.3    10.401_184      0
2q37.3    10.401_184      0

The problem lies in the fact that if cytoband is duplicated in different peakIDs, the resulting table will have the two records (state) for each sample mixed up (as they don't have the relevant unique ID anymore).

The idea would be to suffix the duplicate records across distinct peakIDs (e.g. "2q37.3_A", "2q37.3_B", but I'm not sure on how to do that with groupby or pandas in general as I need information from more than one group.

What's the cleanest solution to do this? Existing solutions (or this one) don't really fit.


Solution

  • Use a groupby.cumcount and a mapper:

    from string import ascii_uppercase
    letters = dict(enumerate(ascii_uppercase))
    
    table['sample'] += '_' + table.groupby(['cytoband', 'sample']).cumcount().map(letters)
    

    Output:

                    sample  state
    cytoband                     
    2q37.3    10.193_156_A      0
    2q37.3    10.193_156_B      0
    2q37.3    10.214_385_A      0
    2q37.3    10.214_385_B      0
    2q37.3    10.388_116_A      1
    2q37.3    10.388_116_B     -1
    2q37.3    10.401_184_A      0
    2q37.3    10.401_184_B      0
    

    Or, keep peakID:

    out = (
            table
            .pivot_longer(index=["peakID", "cytoband", "start", "end", "length"],
                         names_to="sample", values_to="state")
                .sort_values(["cytoband", "sample"], key=natsort_keygen())
                .remove_columns(["length", "start", "end"])
                .set_index(["cytoband", "peakID"])
            )
    

    Output:

                                                      sample  state
    cytoband peakID                                                
    2q37.3   Amp_2q37.3_chr2:237990001-242193529  10.193_156      0
             Del_2q37.3_chr2:226990001-242193529  10.193_156      0
             Amp_2q37.3_chr2:237990001-242193529  10.214_385      0
             Del_2q37.3_chr2:226990001-242193529  10.214_385      0
             Amp_2q37.3_chr2:237990001-242193529  10.388_116      1
             Del_2q37.3_chr2:226990001-242193529  10.388_116     -1
             Amp_2q37.3_chr2:237990001-242193529  10.401_184      0
             Del_2q37.3_chr2:226990001-242193529  10.401_184      0