Seeking to transform a CSV table with schema COL1=INT, COL2=STR, COL3=STR as such:
Input sample
INT1, S1, S2
INT1, S1, S3
INT1, S2, S3
INT2, S1, S2
INT2, S1, S3
INT2, S1, S4
INT2, S2, S3
INT2, S2, S4
INT2, S3, S4
to an output CSV/table where unique strings in COL2 or COL3 are output as COL1 followed by a list (in COL2 of output table/CSV) of unique COL1 INTs where they appeared.
Output CSV/table:
S1, [INT1, INT2]
S2, [INT1, INT2]
S3, [INT1, INT2]
S4, [INT2]
First, you can use melt to vertically stack COL2
and COL3
into long format. Second, you can now use groupby to group by the melted column and get the unique values of COL1
for each sub group (S1, S2, etc...):
# create example df
df = pd.DataFrame([["INT1", "S1", "S5"],
["INT1", "S2", "S3"],
["INT2", "S1", "S2"],
["INT2", "S1", "S3"]],
columns=["COL1", "COL2", "COL3"])
print(df)
COL1 COL2 COL3
0 INT1 S1 S5
1 INT1 S2 S3
2 INT2 S1 S2
3 INT2 S1 S3
result = pd.melt(df, id_vars="COL1").groupby("value")["COL1"].unique()
print(result)
S1 [INT1, INT2]
S2 [INT1, INT2]
S3 [INT1, INT2]
S5 [INT1]