Search code examples
pythonjsonpandasdataframenormalization

Read nested JSON into Pandas DataFrame


Background Info -
I have a JSON response from an API call, which I am trying to save in a pandas DataFrame, whilst maintaining the same structure, as when I view in a system I have called the data from.

Functions that calls JSON Response -
def api_call(): calls the API (Note: url_list only contains 1x url at present) and saves the response in the api_response variable, using json.loads(response.text)

def api_call():
    url_list = url_constructor()
    for url in url_list:
        response = requests.get(url_list[0], auth = HTTPBasicAuth(key, secret), headers={"Firm":"583"})
    api_response = json.loads(response.text)
    return api_response

Function that saves response to file and also returns it:
def response_writer(): saves api_response as a JSON file. It also returns api_response.

def response_writer():
    api_response = api_call()
    timestr = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")
    filename = 'api_response_'+timestr+'.json'
    with open(filename, 'w') as output_data:
        json.dump(api_response, output_data)
        print("-------------------------------------------------------\n", 
              "API RESPONSE SAVED:", filename, "\n-------------------------------------------------------")
    return api_response

JSON Response -

{
  "meta": {
    "columns": [
      {
        "key": "node_id",
        "display_name": "Entity ID",
        "output_type": "Word"
      },
      {
        "key": "bottom_level_holding_account_number",
        "display_name": "Holding Account Number",
        "output_type": "Word"
      },
      {
        "key": "value",
        "display_name": "Adjusted Value (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "node_ownership",
        "display_name": "% Ownership",
        "output_type": "Percent"
      },
      {
        "key": "model_type",
        "display_name": "Model Type",
        "output_type": "Word"
      },
      {
        "key": "valuation",
        "display_name": "Valuation (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "_custom_jb_custodian_305769",
        "display_name": "JB Custodian",
        "output_type": "Word"
      },
      {
        "key": "top_level_owner",
        "display_name": "Top Level Owner",
        "output_type": "Word"
      },
      {
        "key": "top_level_legal_entity",
        "display_name": "Top Level Legal Entity",
        "output_type": "Word"
      },
      {
        "key": "direct_owner",
        "display_name": "Direct Owner",
        "output_type": "Word"
      },
      {
        "key": "online_status",
        "display_name": "Online Status",
        "output_type": "Word"
      },
      {
        "key": "financial_service",
        "display_name": "Financial Service",
        "output_type": "Word"
      },
      {
        "key": "_custom_placeholder_461415",
        "display_name": "Placeholder or Fee Basis",
        "output_type": "Boolean"
      },
      {
        "key": "_custom_close_date_411160",
        "display_name": "Account Close Date",
        "output_type": "Date"
      },
      {
        "key": "_custom_ownership_audit_note_425843",
        "display_name": "Ownership Audit Note",
        "output_type": "Word"
      }
    ],
    "groupings": [
      {
        "key": "holding_account",
        "display_name": "Holding Account"
      }
    ]
  },
  "data": {
    "type": "portfolio_views",
    "attributes": {
      "total": {
        "name": "Total",
        "columns": {
          "direct_owner": null,
          "node_ownership": null,
          "online_status": null,
          "_custom_ownership_audit_note_425843": null,
          "model_type": null,
          "_custom_placeholder_461415": null,
          "top_level_owner": null,
          "_custom_close_date_411160": null,
          "valuation": null,
          "bottom_level_holding_account_number": null,
          "_custom_jb_custodian_305769": null,
          "financial_service": null,
          "top_level_legal_entity": null,
          "value": null,
          "node_id": null
        },
        "children": [
          {
            "entity_id": 4754837,
            "name": "Apple Holdings Adv (748374923)",
            "grouping": "holding_account",
            "columns": {
              "direct_owner": "Apple Holdings LLC",
              "node_ownership": 1,
              "online_status": "Online",
              "_custom_ownership_audit_note_425843": null,
              "model_type": "Holding Account",
              "_custom_placeholder_461415": false,
              "top_level_owner": "Forsyth Family",
              "_custom_close_date_411160": null,
              "valuation": 10423695.609450001,
              "bottom_level_holding_account_number": "748374923",
              "_custom_jb_custodian_305769": "Laverockbank",
              "financial_service": "laverockbankcustodianservice",
              "top_level_legal_entity": "Apple Holdings LLC",
              "value": 10423695.609450001,
              "node_id": "4754837"
            },
          }
        ]
      }
    }
  },
  "included": []
}

Expected structure of JSON in Pandas DataFrame -
This is the structure I am trying to convey in my pandas DataFrame -

| Holding Account                 | Entity ID | Holding Account Number | Adjusted Value (USD) | % Ownership | Model Type      | Valuation (USD) | JB Custodian | Top Level Owner | Top Level Legal Entity          | Direct Owner                    | Online Status | Financial Service   | Placeholder or Fee Basis | Account Close Date | Ownership Audit Note |
|---------------------------------|-----------|------------------------|----------------------|-------------|-----------------|-----------------|--------------|-----------------|---------------------------------|---------------------------------|---------------|---------------------|--------------------------|--------------------|----------------------|
| Apple Holdings Adv (748374923)  | 4754837   | 748374923              | $10,423,695.06       | 100.00%     | Holding Account | $10,423,695.06  | BRF          | Forsyth Family  | Apple Holdings Partners LLC     | Apple Holdings Partners LLC     | Online        | custodianservice    | No                       | -                  | -                    |

My interpretation of the JSON Structure -
It looks like I need to concentrate on {'columns: (which has the column headers), and the 'children' (which represent rows of data, in my case, just 1x row) of 'data':. I can ignore 'groupings': [{'key': 'holding_account', 'display_name': 'Holding Account'}]},, as this ultimately is how the data sorted in the system.

Does anyone have advice on how I might take the JSON and load into a DataFrame with the demonstrated structure?

My interpretation is that I need to set the display_names [columns] as headers and then map the respective children values under each respective display_names / headers. Note: Ordinarily, there would be more children (representing each line of data for my DataFrame), however I have stripped all but 1x out, to make it easier to interpret.


Solution

  • I suggest using pd.json_normalize() ( https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html ) which helps transform JSON data into a pandas DataFrame.

    Note 1: Following I assume the data is available in a python dictionary called data. For testing purpose I used

    import json
    json_data = '''
    {
      "meta": {
          # ....
      },
      #...
      "included": []
    }
    '''
    data = json.loads(json_data)
    

    where json_data is your JSON response. As json.loads() doesn't accept a final comma, I omitted the comma after the children object.

    pd.json_normalize() offers different options. One possibility is to simply read all "children" data and then drop the columns that are not required. Also, after normalizing some columns have a prefix "columns." which needs to be removed.

    import pandas as pd
    df = pd.json_normalize(data['data']['attributes']['total']['children'])
    df.drop(columns=['grouping', 'entity_id'], inplace=True)
    df.columns = df.columns.str.replace(r'columns.', '')
    

    Finally, the columns names need to be replaced with those in the "columns" data:

    column_name_mapper = {column['key']: column['display_name'] for column in data['meta']['columns']}
    df.rename(columns=column_name_mapper, inplace=True)
    

    Note 2: There are some slight deviations from the expected structure you described. Most notably the word 'name' (with the row value "Apple Holdings Adv (748374923)") in the data frame header is not changed to 'Holding Account' as both terms are not found in the columns list. Some other values simply differ between the described JSON response and the expected structure.