Search code examples
pythonpandasdataframedummy-variable

split values of a dictionary into seperate panda dataframe columns -- make them dummy


Let's say we have a dataframe in this format:

id  properties
0   {"cat1":["p1","p2","p4"],"cat2":["p5", "p6"]}
1   {"cat1":["p3"],"cat2":["p7"]}

How can we convert it to this format?

id  p1    p2    p3    p4    p5    p6    p7
0   True  True  False True  True  True  False
1   False False True  False False False True

Keep in mind that the type of values in each cell is string. It has just two categories: cat1 and cat2


Solution

  • Set_index to id. As you said each cell is a string, so you need to convert df.properties from string of dict to dict by using ast.literal_eval. Next, and using str method to get cat1 and cat2 and combine their lists and explode them to rows and assign result to s. Finally, call pd.get_dummies on s with option dtype=bool and call max on level=0

    import ast
    
    df1 = df.set_index('id')
    df1.properties = df1.properties.map(ast.literal_eval)
    s = (df1.properties.str['cat1'] + df1.properties.str['cat2']).explode()
    pd.get_dummies(s, dtype=bool).max(level=0)
    
    Out[1035]:
           p1     p2     p3     p4     p5     p6     p7
    id
    0   True   True   False  True   True   True   False
    1   False  False  True   False  False  False  True