Search code examples
pythoncsvpandashivepython-itertools

How to group unique COL1 values across unique COL2 & COL3 values from CSV


Seeking to transform a CSV table with schema COL1=INT, COL2=STR, COL3=STR as such:

Input sample

INT1, S1, S2
INT1, S1, S3
INT1, S2, S3
INT2, S1, S2
INT2, S1, S3
INT2, S1, S4
INT2, S2, S3
INT2, S2, S4
INT2, S3, S4

to an output CSV/table where unique strings in COL2 or COL3 are output as COL1 followed by a list (in COL2 of output table/CSV) of unique COL1 INTs where they appeared.

Output CSV/table:

S1, [INT1, INT2]
S2, [INT1, INT2]
S3, [INT1, INT2]
S4, [INT2]

Solution

  • First, you can use melt to vertically stack COL2 and COL3 into long format. Second, you can now use groupby to group by the melted column and get the unique values of COL1 for each sub group (S1, S2, etc...):

    # create example df
    df = pd.DataFrame([["INT1", "S1", "S5"], 
                       ["INT1", "S2", "S3"],
                       ["INT2", "S1", "S2"],
                       ["INT2", "S1", "S3"]],
                      columns=["COL1", "COL2", "COL3"])
    print(df)
    
        COL1    COL2    COL3
    0   INT1    S1      S5
    1   INT1    S2      S3
    2   INT2    S1      S2
    3   INT2    S1      S3
    
    result =  pd.melt(df, id_vars="COL1").groupby("value")["COL1"].unique()
    print(result)
    
    S1    [INT1, INT2]
    S2    [INT1, INT2]
    S3    [INT1, INT2]
    S5    [INT1]