Search code examples
jsonpython-3.xpandasdictionaryjmespath

How to crawl JSON dictionary object into pandas dataframe regardless if value missing in dictionary?


I am making a API call and getting my JSON data like so:

import requests 
import jmespath
import pandas as pd
import json

url = 'a.com'

r = requests.get(url).json()

The object returned looks like this:

{'question': [{
   'response': {'firstname': {'value': 'John'},
    'lastname': {'value': 'Bob'}},
   'profile_question': [{
     'identities': [{'type': 'ID,
       'value': '1'},
      {'type': 'EMAIL',
       'value': 'test@test.com'}]}]}]}

I tried putting this into json.fr but I get error that it is not correctly formed json. How every I can crawl this object as is but not successfully for what I need.

I am trying to use jmespath library to crawl and want to pull out four pieces of information firstname, lastname, ID, EMAIL like so and appending the data into list:

    lst =[]

    fname = jmespath.search('question[*].response.{firstname:firstname.value}',my_dict)
    lst.append(fname)

    lname = jmespath.search('question[*].response.{lastname:lastname.value}',my_dict)
    lst.append(lname)

    email_path = jmespath.search("question[*].profile_question[].identities.{email:[?type=='EMAIL'].value}",my_dict)
    lst.append(email)

    ID = jmespath.search("question[*].profile_question[].identities.{email:[?type=='ID'].value}",my_dict)
    lst.append(ID)

I append into a list in hopes of creating tuples per iteration that I can push into a dataframe.

The list looks like this:

[[{'firstname': 'John'}],
 [{'lastname': 'Bob'}],
 [{'email': ['test@test.com']}],
 [{'ID': ['1']}]]

However when I crawl the dictionary with missing values like so:

{'question': [{
   'response': {'firstname': {'value': 'John'},
    'lastname': {'value': 'Bob'}},
   'profile-question': [{
     'identities': [{'type': 'ID,
       'value': '1'},
      {'type': 'EMAIL',
       'value': 'test@test.com'}]}]}],
   'response': {'firstname': {'value': 'John1'},
    'lastname': {'value': 'Bob1'}},
   'profile-question': [{
     'identities': [{'type': 'ID,
       'value': '2'}]}]}

causes my list to behave like this (I can not tell why):

[[{'firstname': 'John'}], [{'email': ['test@test.com']}], [{'email': ['1']},[{'firstname': 'John'}],
 [{'lastname': 'Bob'}],
 [{'email': [][][]}],
 [{'ID': ['1']}]]]

which causes the df to look like this:

firstname         lastname      email                ID
john                 bob        test@test.com        1 
john1                bob1       test@test.com        1

How do I crawl a JSON dict object as it comes in from the API, pulling out four pieces of data firstname, lastname, email, ID and appending into a dataframe like so? :

firstname         lastname      email                    ID
    john                 bob        test@test.com        1 
    john1                bob1                            2

More than willing to get away from jmespath library, and to add, the above dictionary has many more fields, I have shortened so only the key points and their indentation is listed.


Solution

  • Well before anything the reason for the error is because the json object is missing quotes after ID.

     {'question': [{
       'response': {'firstname': {'value': 'John'},
        'lastname': {'value': 'Bob'}},
       'profile_question': [{
         'identities': [{'type': 'ID,
           'value': '1'},
          {'type': 'EMAIL',
           'value': 'test@test.com'}]}]}]}
    

    It should look like this:

    {'question': [{
       'response': {'firstname': {'value': 'John'},
        'lastname': {'value': 'Bob'}},
       'profile_question': [{
         'identities': [{'type': 'ID',
           'value': '1'},
          {'type': 'EMAIL',
           'value': 'test@test.com'}]}]}]}
    

    From here you can use the json library to turn the json object into a python dictionary object with json.loads(). Once you fixed the json object, your code can look something like this.

    import jmespath as jp
    import pandas as pd
    
    
    jon = {'question':
                  [{'response': {'firstname': {'value': 'John'},
                                 'lastname': {'value': 'Bob'}},
                    'profile_question': [{'identities': [{'type': 'ID',
                                                          'value': '1'},
                                                         {'type': 'EMAIL', 'value': 'test@test.com'}]}]}]}
    
    
    jsons = [jon] # list of all json objects
    df_list = []
    for json in jsons:
        try:
            fname = jp.search('question[*].response.firstname.value', jon)[0]
        except IndexError:
            fname = None
        try:
            lname = jp.search('question[*].response.lastname.value', jon)[0]
        except IndexError:
            lname = None
        try:
            email = jp.search("question[*].profile_question[].identities.{email:[?type=='EMAIL'].value}", jon)[0]['email'][0]
        except IndexError:
            email = None
        try:
            user_id = jp.search("question[*].profile_question[].identities.{email:[?type=='ID'].value}", jon)[0]['email'][0]
        except IndexError:
            user_id = None
        df_list.append(pd.DataFrame({'firstname': fname, 'lastname': lname, 'email': email, 'id': user_id}, index=[0]))
    
    
    df = pd.concat(df_list, ignore_index=True, sort=False)
    print(df)