Search code examples
pythonpandasnormalizationjson-normalize

Using json_normalize with pandas


I'm using the following code in python to flatten the json structure below, however it doesn't work for all levels. I'm interested in the tags.tags column data specifically shown on the picture below

pd.json_normalize(data['conversations'])
pd.set_option('display.max_columns', None)

Json structure:

[{'type': 'conversation.list',
  'pages': {'type': 'pages',
   'next': {'page': 3,
    'starting_after': 'WzE3MTU3ODIxOTc=='},
   'page': 2,
   'per_page': 1,
   'total_pages': 46969},
  'total_count': 46969,
  'conversations': [{'type': 'conversation',
    'id': '1384780',
    'created_at': 1715780970,
    'updated_at': 1715782197,
    'waiting_since': None,
    'snoozed_until': None,
    'source': {'type': 'conversation',
     'id': '2197597651',
     'delivered_as': 'customer_initiated',
     'subject': '',
     'body': '<p>Outros</p>',
     'author': {'type': 'user',
      'id': '64ac5cacccd2b047',
      'name': 'Claudiney Pinho',
      'email': '[email protected]'},
     'attachments': [],
     'url': None,
     'redacted': False},
    'contacts': {'type': 'contact.list',
     'contacts': [{'type': 'contact',
       'id': '64accccd71982047',
       'external_id': 'b363-cc8f--fb270b5e72e8'}]},
    'first_contact_reply': {'created_at': 1715780970,
     'type': 'conversation',
     'url': None},
    'admin_assignee_id': 5614527,
    'team_assignee_id': 5045796,
    'open': False,
    'state': 'closed',
    'read': True,
    'tags': {'type': 'tag.list',
     'tags': [{'type': 'tag',
       'id': '5379642',
       'name': '[BOT] Other',
       'applied_at': 1715781024,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '5379660',
       'name': '[BOT] Connected Agent',
       'applied_at': 1715781025,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '5379654',
       'name': '[BOT] Not Resolved',
       'applied_at': 1715781027,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '7046337',
       'name': '[BOT] Portuguese',
       'applied_at': 1715781029,
       'applied_by': {'type': 'admin', 'id': '4685750'}}]},
    'priority': 'not_priority',
    'sla_applied': None,
    'statistics': {'type': 'conversation_statistics',
     'time_to_assignment': 0,
     'time_to_admin_reply': 189,
     'time_to_first_close': 1158,
     'time_to_last_close': 1228,
     'median_time_to_reply': 139,
     'first_contact_reply_at': 1715780970,
     'first_assignment_at': 1715780970,
     'first_admin_reply_at': 1715781159,
     'first_close_at': 1715782128,
     'last_assignment_at': 1715781159,
     'last_assignment_admin_reply_at': 1715781159,
     'last_contact_reply_at': 1715782179,
     'last_admin_reply_at': 1715782125,
     'last_close_at': 1715782198,
     'last_closed_by_id': 5614527,
     'count_reopens': 1,
     'count_assignments': 3,
     'count_conversation_parts': 28},
    'conversation_rating': None,
    'teammates': {'type': 'admin.list',
     'admins': [{'type': 'admin', 'id': '5614527'}]},
    'title': None,
    'custom_attributes': {'Language': 'Portuguese',
     'Conversation status': 'Open',
     'From': 'iOS / Android'},
    'topics': {'type': 'topic.list', 'topics': [], 'total_count': 0},
    'ticket': None,
    'linked_objects': {'type': 'list',
     'data': [],
     'total_count': 0,
     'has_more': False}}]}]

how the dataset is displayed

I wish to create a pandas data frame with flatten key:values in a table


Solution

  • You can try using pd.json_normalize function with the record_path and meta parameters. This should flatten the JSON to different multiple levels of your choice, you can then further extract each nested lists or dictionaries into columns, which is of interest to you.

    I have provided a code snippet for you, I am not using jupyter so you might need to test it but in theory this should work nicely, since the concept is true.

    import pandas as pd
    
    df = pd.json_normalize(
        data[0]["conversations"],
        record_path=["tags", "tags"],
        meta=[
            "id",
            "created_at",
            "updated_at",
            ["source", "id"],
            ["source", "author", "name"],
        ],
        meta_prefix="meta_",  # trying to avoid conflicts with available ids 
        errors="ignore",
    )
    pd.set_option("display.max_columns", None)
    

    Also note you have many NaN which caused errors, also I had to use meta_prefix for id.

      applied_by.id  meta_id meta_created_at meta_updated_at meta_source.id  \
    0       4685750  1384780      1715780970      1715782197            NaN   
    1       4685750  1384780      1715780970      1715782197            NaN   
    2       4685750  1384780      1715780970      1715782197            NaN   
    3       4685750  1384780      1715780970      1715782197            NaN   
    
      meta_source.author.name  
    0                     NaN  
    1                     NaN  
    2                     NaN  
    3                     NaN