Search code examples
pandasdataframedummy-variable

Creating dummy columns from cells with multiple values


I have a DF as shown below:

DF =
id  Result      
1   Li_In-AR-B, Or_Ba-AR-B
1   Li_In-AR-L, Or_Ba-AR-B
3   N
4   Lo_In-AR-U
5   Li_In-AR-U
6   Or_Ba-AR-B
6   Or_Ba-AR-L
7   N

Now I want to create new columns for every unique value in Result before the first "-". Every other value in the new column should be set to N. The delimiter "," is used to separate both instances in case of multiple values (2 or more).

DF =
id  Result        Li_In         Lo_In       Or_Ba
1   Li_In-AR-B    Li_In-AR-B    N           Or_Ba-AR-B
1   Li_In-AR-L    Li_In-AR-L    N           Or_Ba-AR-B
3   N             N             N           N
4   Lo_In-AR-U    N             Lo_In-AR-U  N
5   Li_In-AR-U    Li_In-AR-U    N           N
6   Or_Ba-AR-B    N             N           Or_Ba-AR-B
6   Or_Ba-AR-L    N             N           Or_Ba-AR-L
7   N             N             N           N

I thought I could do this easily using .get_dummies but this only returns a binary value for each cell.

DF_dummy = DF.Result.str.get_dummies(sep='-')
DF = pd.concat([DF,DF_dummy ],axis=1)

Also this solution for an earlier post is not applicable for the new case.

m = DF['Result'].str.split('-', n=1).str[0].str.get_dummies().drop('N', axis=1) == 1
df1 = pd.concat([DF['Result']] * len(m.columns), axis=1, keys=m.columns)

Any ideas?


Solution

  • Use dictionary comprehension with DataFrame constructor for split by ,\s+ for split by coma with one or more whitespaces.

    import re
    f = lambda x: {y.split('-', 1)[0] : y for y in re.split(',\s+', x) if y != 'N' } 
    df1 = pd.DataFrame(DF['Result'].apply(f).values.tolist(), index=DF.index).fillna('N')
    print (df1)
            Li_In       Lo_In       Or_Ba
    0  Li_In-AR-B           N  Or_Ba-AR-B
    1  Li_In-AR-L           N  Or_Ba-AR-B
    2           N           N           N
    3           N  Lo_In-AR-U           N
    4  Li_In-AR-U           N           N
    5           N           N  Or_Ba-AR-B
    6           N           N  Or_Ba-AR-L
    7           N           N           N
    

    Last add to original DataFrame:

    df = DF. join(df1) 
    print (df)
       id                  Result       Li_In       Lo_In       Or_Ba
    0   1  Li_In-AR-B, Or_Ba-AR-B  Li_In-AR-B           N  Or_Ba-AR-B
    1   1  Li_In-AR-L, Or_Ba-AR-B  Li_In-AR-L           N  Or_Ba-AR-B
    2   3                       N           N           N           N
    3   4              Lo_In-AR-U           N  Lo_In-AR-U           N
    4   5              Li_In-AR-U  Li_In-AR-U           N           N
    5   6              Or_Ba-AR-B           N           N  Or_Ba-AR-B
    6   6              Or_Ba-AR-L           N           N  Or_Ba-AR-L
    7   7                       N           N           N           N