Search code examples
pythonjsonpandasjson-normalize

Is there a way to normalize a json pulled straight from an api


Here is the type of json file that I am working with

{
  "header": {
    "gtfsRealtimeVersion": "1.0",
    "incrementality": "FULL_DATASET",
    "timestamp": "1656447045"
  },
  "entity": [
    {
      "id": "RTVP:T:16763243",
      "isDeleted": false,
      "vehicle": {
        "trip": {
          "tripId": "16763243",
          "scheduleRelationship": "SCHEDULED"
        },
        "position": {
          "latitude": 33.497833,
          "longitude": -112.07365,
          "bearing": 0.0,
          "odometer": 16512.0,
          "speed": 1.78816
        },
        "currentStopSequence": 16,
        "currentStatus": "INCOMING_AT",
        "timestamp": "1656447033",
        "stopId": "2792",
        "vehicle": {
          "id": "5074"
        }
      }
    },
    {
      "id": "RTVP:T:16763242",
      "isDeleted": false,
      "vehicle": {
        "trip": {
          "tripId": "16763242",
          "scheduleRelationship": "SCHEDULED"
        },
        "position": {
          "latitude": 33.562374,
          "longitude": -112.07392,
          "bearing": 359.0,
          "odometer": 40367.0,
          "speed": 15.6464
        },
        "currentStopSequence": 36,
        "currentStatus": "INCOMING_AT",
        "timestamp": "1656447024",
        "stopId": "2794",
        "vehicle": {
          "id": "5251"
        }
      }
    }
  ]
}

In my code, I am taking in the json as a string. But when I try normalize json string to put into data frame

import pandas as pd
import json
import requests


base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
packages_str = json.dumps(packages_json, indent=1)

df = pd.json_normalize(packages_str)

I get this error, I am definitely making some rookie error, but how exactly am I using this wrong? Are there additional arguments that may need in that?

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-33-aa23f9157eac> in <module>()
      8 packages_str = json.dumps(packages_json, indent=1)
      9 
---> 10 df = pd.json_normalize(packages_str)

/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    421         data = list(data)
    422     else:
--> 423         raise NotImplementedError
    424 
    425     # check to see if a simple recursive function is possible to

NotImplementedError: 

When I had the json format within my code without the header portion referenced as an object, the pd.json_normalize(package_str) does work, why would that be, and what additional things would I need to do?


Solution

  • The issue is, that pandas.json_normalize expects either a dictionary or a list of dictionaries but json.dumps returns a string.

    It should work if you skip the json.dumps and directly input the json to the normalizer, like this:

    import pandas as pd
    import json
    import requests
    
    
    base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
    packages_json = base_URL.json()
    
    df = pd.json_normalize(packages_json) 
    

    If you take a look at the corresponding source-code of pandas you can see for yourself:

    if isinstance(data, list) and not data:
        return DataFrame()
    elif isinstance(data, dict):
        # A bit of a hackjob
        data = [data]
    elif isinstance(data, abc.Iterable) and not isinstance(data, str):
        # GH35923 Fix pd.json_normalize to not skip the first element of a
        # generator input
        data = list(data)
    else:
        raise NotImplementedError
    

    You should find this code at the path that is shown in the stacktrace, with the error raised on line 423: /usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py

    I would advise you to use a code-linter or an IDE that has one included (like PyCharm for example) as this is the type of error that doesn't happen if you have one.