Search code examples
pythonjsonpandasnormalize

Normalize nested JSON data with Pandas/Python


I'm trying to normalize a similar sample data

{
  "2018-04-26 10:09:33": [
    {
      "user_id": "M8BE957ZA",
      "ts": "2018-04-26 10:06:33",
      "message": "Hello"
    }
  ],
  "2018-04-27 19:10:55": [
    {
      "user_id": "M5320QS1X",
      "ts": "2018-04-27 19:10:55",
      "message": "Thank you"
    }
  ],

I know I can use json_normalize(data,'2018-04-26 10:09:33',record_prefix= '') to create a table in pandas but the date/time keeps changing. How can I normalize it so I have as follow? Any suggestions

                          user_id.        ts                    message

2018-04-26 10:09:33       M8BE957ZA.      2018-04-26 10:06:33.  Hello
2018-04-26 10:09:33       M5320QS1X       2018-04-27 19:10:55.  Thank you

Solution

  • test = {
      "2018-04-26 10:09:33": [
        {
          "user_id": "M8BE957ZA",
          "ts": "2018-04-26 10:06:33",
          "message": "Hello"
        }
      ],
      "2018-04-27 19:10:55": [
        {
          "user_id": "M5320QS1X",
          "ts": "2018-04-27 19:10:55",
          "message": "Thank you"
        }
      ]}
    df = pd.DataFrame(test).melt()
    
    
        variable            value
    0   2018-04-26 10:09:33 {'user_id': 'M8BE957ZA', 'ts': '2018-04-26 10:...
    1   2018-04-27 19:10:55 {'user_id': 'M5320QS1X', 'ts': '2018-04-27 19:...
    

    Read in your dataframe as your dict, then melt it to get the above structure. Next you can use json_normalize on the value column, then rejoin it to the variable column like so:

    df.join(json_normalize(df['value'])).drop(columns = 'value').rename(columns = {'variable':'date'})
    
        date                user_id     ts                  message
    0   2018-04-26 10:09:33 M8BE957ZA   2018-04-26 10:06:33 Hello
    1   2018-04-27 19:10:55 M5320QS1X   2018-04-27 19:10:55 Thank you