Search code examples
python-3.xpandasseabornboxplotjson-normalize

How to create box plots from columns of dicts in pandas


  • I have a dataframe where each row is a dictionary on which I'd like to use seaborn's horizontal box plot.

    1. The x axis should be the float values for each 'dialog'
    2. The y axis should show the 4 different models
    3. There should be a plot for each parts of speech, meaning there should be a graph for 'INTJ', another for 'ADV' and so on.
  • I'm thinking I'll have to do a pd.melt first to restructure the data first so that the new columns would be 'dialog_num', 'model_type', and 'value' (automatic variable name after doing a melt, but basically the rows of dictionaries).

  • After that, perhaps break the 'value' variable so that each column is a part of speech ('ADV', 'INTJ', 'VERB', etc.) (this part seems tricky to me). Past this point...do a for loop on all of the columns and apply the horizontal boxplot?

import pandas as pd

pos =\
{'dialog_num': {0: 0, 1: 1, 2: 2},
 'model1': {0: {'ADV': 0.072, 'INTJ': 0.03, 'PRON': 0.133, 'VERB': 0.109},
            1: {'ADJ': 0.03, 'NOUN': 0.2, 'PRON': 0.13},
            2: {'ADV': 0.083, 'PRON': 0.125, 'VERB': 0.0625}},
 'model2': {0: {'ADJ': 0.1428, 'ADV': 0.1428, 'AUX': 0.1428, 'INTJ': 0.285},
            1: {'ADJ': 0.1, 'DET': 0.1, 'NOUN': 0.1, 'PROPN': 0.1, 'VERB': 0.2},
            2: {'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166, 'VERB': 0.3333}},
 'model3': {0: {'ADJ': 0.06, 'CCONJ': 0.06, 'NOUN': 0.2, 'PRON': 0.266, 'SPACE': 0.066, 'VERB': 0.333},
            1: {'AUX': 0.15, 'PRON': 0.25, 'PUNCT': 0.15, 'VERB': 0.15},
            2: {'ADP': 0.125, 'PRON': 0.0625, 'PUNCT': 0.0625, 'VERB': 0.25}},
 'model4': {0: {'ADJ': 0.25, 'ADV': 0.08, 'CCONJ': 0.083, 'PRON': 0.166},
            1: {'AUX': 0.33, 'PRON': 0.2, 'VERB': 0.0667},
            2: {'CCONJ': 0.125, 'NOUN': 0.125, 'PART': 0.125, 'PRON': 0.125, 'SPACE': 0.125, 'VERB': 0.375}}}
df = pd.DataFrame.from_dict(pos)
display(df)
   dialog_num                                                      model1                                                            model2                                                                                   model3                                                                                        model4
0           0  {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}      {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}  {'PRON': 0.266, 'VERB': 0.333, 'ADJ': 0.06, 'NOUN': 0.2, 'CCONJ': 0.06, 'SPACE': 0.066}                                     {'PRON': 0.166, 'ADV': 0.08, 'ADJ': 0.25, 'CCONJ': 0.083}
1           1                    {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}  {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}                                 {'PRON': 0.25, 'AUX': 0.15, 'VERB': 0.15, 'PUNCT': 0.15}                                                    {'PRON': 0.2, 'AUX': 0.33, 'VERB': 0.0667}
2           2               {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}   {'VERB': 0.3333, 'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166}                            {'PRON': 0.0625, 'VERB': 0.25, 'PUNCT': 0.0625, 'ADP': 0.125}  {'PRON': 0.125, 'VERB': 0.375, 'PART': 0.125, 'CCONJ': 0.125, 'NOUN': 0.125, 'SPACE': 0.125}

Solution

    • sns.boxplot expects data to be supplied in a long form when specifying x= and y=.
    • In this case, based on the specifications of having each speech type as a separate plot, sns.catplot will be used because there is a col= parameter, which can be used to create separate plots for speech types.
    1. As mentioned in the OP, use .melt to unpivot the wide dataframe.
    2. .json_normalize can be used to convert the the 'value' column (dict type) into a flat table.
    3. Join the flattened table (vals) to dfm with .join.
      • This works because vals and dfm have matching indices.
    4. .melt the dataframe again.
    5. Plot the box plot from the long form dataframe.
    • Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
    import pandas as pd
    import seaborn as sns
    
    # load the dict into a dataframe
    df = pd.DataFrame(pos)
    
    # unpivot the dataframe
    dfm = df.melt(id_vars='dialog_num', var_name='model')
    
    # convert the 'value' column of dicts to a flat table
    vals = pd.json_normalize(dfm['value'])
    
    # combine vals to dfm, without the 'value' column
    dfm = dfm.iloc[:, 0:-1].join(vals)
    
    # unpivot the dataframe again
    dfm = dfm.melt(id_vars=['dialog_num', 'model'])
    

    plot all of the speech types together

    p = sns.boxplot(data=dfm, x='value', y='model')
    

    enter image description here

    plot speech types separately

    • Most speech types have only a single value, or no values.
    p = sns.catplot(kind='box', data=dfm, x='value', y='model', col='variable', col_wrap=4, height=4)
    

    enter image description here

    DataFrames at each step

    1: dfm.head()

       dialog_num   model                                                             value
    0           0  model1        {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}
    1           1  model1                          {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}
    2           2  model1                     {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}
    3           0  model2      {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}
    4           1  model2  {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}
    

    2: vals.head()

        INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
    0  0.030  0.0720  0.133  0.1090     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    1    NaN     NaN  0.130     NaN  0.0300   0.2     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    2    NaN  0.0830  0.125  0.0625     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    3  0.285  0.1428    NaN     NaN  0.1428   NaN  0.1428    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    4    NaN     NaN    NaN  0.2000  0.1000   0.1     NaN    0.1  0.1    NaN    NaN    NaN  NaN   NaN
    

    3: dfm.head()

       dialog_num   model   INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
    0           0  model1  0.030  0.0720  0.133  0.1090     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    1           1  model1    NaN     NaN  0.130     NaN  0.0300   0.2     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    2           2  model1    NaN  0.0830  0.125  0.0625     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    3           0  model2  0.285  0.1428    NaN     NaN  0.1428   NaN  0.1428    NaN  NaN    NaN    NaN    NaN  NaN   NaN
    4           1  model2    NaN     NaN    NaN  0.2000  0.1000   0.1     NaN    0.1  0.1    NaN    NaN    NaN  NaN   NaN
    

    4: dfm.head()

       dialog_num   model variable  value
    0           0  model1     INTJ  0.030
    1           1  model1     INTJ    NaN
    2           2  model1     INTJ    NaN
    3           0  model2     INTJ  0.285
    4           1  model2     INTJ    NaN