Search code examples
pythonpandasdataframecsvencode

How to label-encode comma separated text in a Dataframe column in Python?


I have dataframe(df) that looks like something like this:

Shape Weight Colour
Circle 5 Blue, Red
Square 7 Yellow, Red
Triangle 8 Blue, Yellow, Red
Rectangle 10 Green

I would like to label encode the "Colour" column so that the dataframe looks like this:

Shape Weight Blue Red Yellow Green
Circle 5 1 1 0 0
Square 7 0 1 1 0
Triangle 8 1 1 1 0
Rectangle 10 0 0 0 1

Is there an easy function to do this type of conversion ? Any pointers in the right direction would be appreciated. Thanks.


Solution

  • Try:

    df["Colour"] = df["Colour"].str.split(r"\s*,\s*", regex=True)
    x = df.explode("Colour")
    
    df_out = (
        pd.concat(
            [df.set_index("Shape"), pd.crosstab(x["Shape"], x["Colour"])], axis=1
        )
        .reset_index()
        .drop(columns="Colour")
    )
    print(df_out)
    

    Prints:

           Shape  Weight  Blue  Green  Red  Yellow
    0     Circle       5     1      0    1       0
    1     Square       7     0      0    1       1
    2   Triangle       8     1      0    1       1
    3  Rectangle      10     0      1    0       0