I have a dataframe like this:
peakID cytoband start end length 10.388_116 10.193_156 10.401_184 10.214_385
0 Amp_2q37.3_chr2:237990001-242193529 2q37.3 237990001 242193529 4203528 1 0 0 0
1 Del_2q37.3_chr2:226990001-242193529 2q37.3 226990001 242193529 15203528 -1 0 0 0
Notice how peakID
is different, but cytoband
is not. I need to unpivot this table (using a function from pyjanitor
) without keeping peakID
. Currently I do:
import pandas as pd
import pyjanitor
from natsort import natsort_keygen
table = (
table
.drop(columns="peakID")
.pivot_longer(index=["cytoband", "start", "end", "length"],
names_to="sample", values_to="state")
.sort_values(["cytoband", "sample"], key=natsort_keygen())
.remove_columns(["length", "start", "end"])
.set_index("cytoband")
)
And the end result looks like this:
table.loc["2q37.3", :]
Out[36]:
sample state
cytoband
2q37.3 10.193_156 0
2q37.3 10.193_156 0
2q37.3 10.214_385 0
2q37.3 10.214_385 0
2q37.3 10.388_116 1
2q37.3 10.388_116 -1
2q37.3 10.401_184 0
2q37.3 10.401_184 0
The problem lies in the fact that if cytoband
is duplicated in different peakID
s, the resulting table will have the two records (state
) for each sample mixed up (as they don't have the relevant unique ID anymore).
The idea would be to suffix the duplicate records across distinct peakIDs (e.g. "2q37.3_A", "2q37.3_B", but I'm not sure on how to do that with groupby
or pandas in general as I need information from more than one group.
What's the cleanest solution to do this? Existing solutions (or this one) don't really fit.
Use a groupby.cumcount
and a mapper:
from string import ascii_uppercase
letters = dict(enumerate(ascii_uppercase))
table['sample'] += '_' + table.groupby(['cytoband', 'sample']).cumcount().map(letters)
Output:
sample state
cytoband
2q37.3 10.193_156_A 0
2q37.3 10.193_156_B 0
2q37.3 10.214_385_A 0
2q37.3 10.214_385_B 0
2q37.3 10.388_116_A 1
2q37.3 10.388_116_B -1
2q37.3 10.401_184_A 0
2q37.3 10.401_184_B 0
Or, keep peakID:
out = (
table
.pivot_longer(index=["peakID", "cytoband", "start", "end", "length"],
names_to="sample", values_to="state")
.sort_values(["cytoband", "sample"], key=natsort_keygen())
.remove_columns(["length", "start", "end"])
.set_index(["cytoband", "peakID"])
)
Output:
sample state
cytoband peakID
2q37.3 Amp_2q37.3_chr2:237990001-242193529 10.193_156 0
Del_2q37.3_chr2:226990001-242193529 10.193_156 0
Amp_2q37.3_chr2:237990001-242193529 10.214_385 0
Del_2q37.3_chr2:226990001-242193529 10.214_385 0
Amp_2q37.3_chr2:237990001-242193529 10.388_116 1
Del_2q37.3_chr2:226990001-242193529 10.388_116 -1
Amp_2q37.3_chr2:237990001-242193529 10.401_184 0
Del_2q37.3_chr2:226990001-242193529 10.401_184 0