Search code examples
pythonpandasdataframeone-hot-encoding

Pandas convert a column containing strings into new columns


I have a dataframe with columns that contains comma separated strings. I would like to create new columns similar to what one hot encoding does.

Below is a very simplistic example. In my use case, I have thousands of rows with more columns, and two columns containing comma separated many strings. I could have used apply+lamda function+string contains condition to create each column but that is very tedious as it will be 100s of new columns

Input Datafarme

ColumnA    ColumnB
 1         {"alpha", "bravo"} 
 2         {"bravo", "charlie"}
 3         {"alpha", "charlie","gama"}
 4         {"bravo", "charlie","delta"}

Output dataframe

ColumnA    alpha     bravo   charlie delta gamma  
 1         1         1       0       0     0       
 2         0         1       0       0     0
 3         1         0       1       0     1
 4         0         1       1       1     0

edit: I should have mentioned that the data is read from csv file. Column B is is of type str. It contains many strings inside. it looks {"value1" , "value2"...."valueN"} here values could be of the format "XYZ" or "X Y Z" or "X_Y_Z". The example i have provided is very simplistic version


Solution

  • Assuming you really have string representations of sets, first remove the {"/"} and internal ", ", then use str.get_dummies:

    out = (df.drop(columns='ColumnB')
             .join(df['ColumnB'].str.replace('^{"|"(,)\s*"|"}$', r'\1', regex=True)
                                .str.get_dummies(sep=',')
                   )
            )
    

    Output:

       ColumnA  alpha  bravo  charlie  delta  gama
    0        1      1      1        0      0     0
    1        2      0      1        1      0     0
    2        3      1      0        1      0     1
    3        4      0      1        1      1     0
    

    Alternative

    similar to @PaulS's approach, but with ast.literal_eval:

    from ast import literal_eval
    s = df['ColumnB'].apply(literal_eval).explode()
    out = df.drop(columns='ColumnB').join(pd.crosstab(s.index, s))