Search code examples
pythonjsonpandastypeerrorjson-normalize

Partly flatten nested JSON objects stored in df column


I have a dataframe with 3 columns (INSTNR, Enhedsadresser, API_response), where the 3rd column (API_response) contains JSON objects. I would like to flatten the JSON object and store the extracted information in separate columns within the same df. I am particularly interested in extracting kategori, resultater -> adresse -> id, and resultater -> adresse -> adgangsadresseid information.

I have tried:

data = json_normalize(data=df['API_response'], record_path='resultater',
                            meta=['kategori'], errors='ignore')

but it simply returns TypeError: string indices must be integers

Whereas data = json_normalize(data=df['API_response']) gave me a column with a list of indices...

How can I extract the needed information?

Example of a JSON object:

{
  "kategori": "A",
  "resultater": [
    {
      "adresse": {
        "id": "0a3f50bc-f815-32b8-e044-0003ba298018",
        "vejnavn": "Staldgaardsgade",
        "adresseringsvejnavn": "Staldgaardsgade",
        "husnr": "39A",
        "supplerendebynavn": null,
        "postnr": "7100",
        "postnrnavn": "Vejle",
        "status": 1,
        "virkningstart": "2009-11-24T02:15:25.000Z",
        "virkningslut": null,
        "adgangsadresseid": "0a3f5090-edef-32b8-e044-0003ba298018",
        "etage": "st",
        "dør": "th",
        "href": "https://api.dataforsyningen.dk/adresser/0a3f50bc-f815-32b8-e044-0003ba298018"
      },
      "aktueladresse": {
        "id": "0a3f50bc-f815-32b8-e044-0003ba298018",
        "vejnavn": "Staldgaardsgade",
        "adresseringsvejnavn": "Staldgaardsgade",
        "husnr": "39A",
        "supplerendebynavn": null,
        "postnr": "7100",
        "postnrnavn": "Vejle",
        "status": 1,
        "virkningstart": "2009-11-24T02:15:25.000Z",
        "virkningslut": null,
        "adgangsadresseid": "0a3f5090-edef-32b8-e044-0003ba298018",
        "etage": "st",
        "dør": "th",
        "href": "https://api.dataforsyningen.dk/adresser/0a3f50bc-f815-32b8-e044-0003ba298018"
      },
      "vaskeresultat": {
        "variant": {
          "vejnavn": "Staldgaardsgade",
          "husnr": "39A",
          "etage": "st",
          "dør": "th",
          "supplerendebynavn": null,
          "postnr": "7100",
          "postnrnavn": "Vejle"
        },
        "afstand": 0,
        "forskelle": {
          "vejnavn": 0,
          "husnr": 0,
          "postnr": 0,
          "postnrnavn": 0,
          "etage": 0,
          "dør": 0
        },
        "parsetadresse": {
          "vejnavn": "Staldgaardsgade",
          "husnr": "39A",
          "etage": "st",
          "dør": "th",
          "postnr": "7100",
          "postnrnavn": "Vejle"
        },
        "ukendtetokens": [],
        "anvendtstormodtagerpostnummer": null
      }
    }
  ]
}

Link to API response containing this JSON object: https://api.dataforsyningen.dk/datavask/adresser?betegnelse=Staldgaardsgade%2039A%20st%20th,%207100%20Vejle

EDIT 1

I created GitHub repo with data and python script: https://github.com/mantasbacys/TREFOR


Solution

  • Series.str.get can help you extract individual elements from a dictionary column.

    If API_response is a column of dictionaries formatted as your example, you can use:

    df['kategori'] = df['api_response'].str.get('kategori')
    df['id'] = (
        df['api_response']
        .str.get('resultater')  # get resultater object
        .str.get(0)  # get first element in list
        .str.get('adresse')
        .str.get('id')
    )
    df['adgangsadresseid'] =  = (
        df['api_response']
        .str.get('resultater')  # get resultater object
        .str.get(0)  # get first element in list
        .str.get('adresse')
        .str.get('adgangsadresseid')
    )
    

    Note that this will fail if the format doesn't match, so double check! E.g. if there are multiple elements in the resultater list, you may want to use Series.explode instead.


    EDIT:

    Your file shows that API_response column actually contains parsed JSON dictionaries saved as strings. This means .str.get methods do not work as the values are not valid dictionaries.

    As a quick fix, you should convert the dictionary strings to actual dictionaries first:

    def convert(s): 
        """Convert dictionary string to JSON-like format and parse with json module"""
        s = s.replace("'", '"')
        s = s.replace("None", "null")
        return json.loads(s)
    
    df['api_response'] = df.API_response.apply(convert)
    # Alternative - UNSAFE! (see https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice)
    # df['api_response'] = df.API_response.apply(eval)
    

    I am assuming this is happening because you first download the data, then you parse it into a dictionary, and save in a file before you load it again. This conversion to a string column happens as you save the file.

    To avoid this conversion to string and back, try parsing the JSON and extracting values of interest in a single script.