Search code examples
pythonjsonpython-3.xpandasjson-normalize

How to import multiple json sections (all in one file) in to python/pandas


I'm trying to process a json file for use in another program that will use an excel file. My json file has multiple sections/arrays, one for some basic info like number of records, report name. It has sections/arrays for column names and then another that has each individual row/record as an array.

I've tried using pandas.read_json and the json_loads options to read the data in but I continually get errors. If I remove all sections except just one (like the rows) section, I can get it to read that in (although it puts all columns in one column like it's not identifying each column name separated by a comma as a separate column.

Ideally I'd like to not need to manually edit this file and just process it in python into a pandas dataframe so I can further manipulate it and export it for other use.

Any help would be greatly appreciated. This is my first post so if there's anything I can do better let me know!

Here's a representation of the json data, the actual data has several more columns and many more rows/records (700+ usually)

{
"count": 2,
"name": "report1",
"columnNames": [
    "Incident ID",
    "Status",
    "Subject"
],
"rows": [
    [
        "2460636",
        "Resolved",
        "login help"
    ],
    [
        "2460637",
        "Unresolved",
        "email help"
    ]
    ]
}

I'm trying to get the columnNames section to become the column names in a pandas dataframe and each "row" to be a record in the dataframe.

I've tried looking at other examples but I haven't come across a similar issue with json formatted like this.

I've tried using pandas.read_json("example.json") as well as the json.loads to load the data to get it in but they all come up with different errors that I can't seem to get around.

When Running the pandas.read_json("example.json") it comes back saying "arrays must all be same length".

Result should be the columnNames section/array should be the column names of the pandas dataframe and then each "row" I want to be a record in the dataframe.


Solution

  • Use pd.json_normalize: to unpack your json file

    pd.json_normalize

    import pandas as pd
    import json
    
    with open('test.json') as f:
        data = json.load(f)
    
    json_data = pd.json_normalize(data)
    

    Output:

                           columnNames  count  name                            rows
    0   [Incident ID, Status, Subject]  2   report1 [[2460636, Resolved, login help], [2460637, Un...
    

    Unpack rows:

    df_rows = pd.json_normalize(data, record_path=['rows'], meta=['name'])
    df_rows.rename({0: data['columnNames'][0],
                    1: data['columnNames'][1],
                    2: data['columnNames'][2]}, axis=1, inplace=True) 
    

    Output of df_row:

         Incident ID        Status       Subject       name
    0        2460636      Resolved    login help    report1
    1        2460637    Unresolved    email help    report1
    

    The json isn't particularly well formed, something like the following would be easier to unpack:

    {
        "count": 2,
        "name": "report1",
        "rows": [{
                "Incident ID": "2460636",
                "Status": "Resolved",
                "Subject": "login help"
            }, {
                "Incident ID": "2460637",
                "Status": "Unresolved",
                "Subject": "email help"
            }
        ]
    }