Search code examples
pythonjsonpandasnormalize

How to normalize json file containing a list (that should be kept as a list) in Python | Pandas?


I'm trying to use the json_normalize function to convert a json file into a dataframe.

Source JSON

  • The json is a list of dictionaries that look something like this:

    {
          "sport_key": "basketball_ncaab",
          "sport_nice": "NCAAB",
          "teams": [
              "Bryant Bulldogs",
              "Wagner Seahawks"
          ],
          "commence_time": 1608152400,
          "home_team": "Bryant Bulldogs",
          "sites": [
              {
                  "site_key": "marathonbet",
                  "site_nice": "Marathon Bet",
                  "last_update": 1608156452,
                  "odds": {
                      "h2h": [
                          1.28,
                          3.54
                      ]
                  }
              },
              {
                  "site_key": "sport888",
                  "site_nice": "888sport",
                  "last_update": 1608156452,
                  "odds": {
                      "h2h": [
                          1.13,
                          5.8
                      ]
                  }
              },
              {
                  "site_key": "unibet",
                  "site_nice": "Unibet",
                  "last_update": 1608156434,
                  "odds": {
                      "h2h": [
                          1.13,
                          5.8
                      ]
                  }
              }
          ],
          "sites_count": 3
      }
    

The problem is that one of the future columns contains a list (which should be the case), but including this column in the meta part of the json_normalize function throws the following error:

ValueError: operands could not be broadcast together with shape (22,) (11,)

The error appears when I try to add "teams" in the list in the following code:

pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])

Solution

  • Assuming data is a list of dictionaries, you can still use json_normalize but you have to assign the teams column seperately for each corresponding dictionary in data:

    def normalize(d):
        return pd.json_normalize(d, 'sites', ['sport_key', 'sport_nice', 'home_team'])\
               .assign(teams=[d['teams']]*len(d['sites']))
    
    
    df = pd.concat([normalize(d) for d in data], ignore_index=True)
    

    Alternatively you can try:

    data = [{**d, 'teams': ','.join(d['teams'])} for d in data]
    df = pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
    df['teams'] = df['teams'].str.split(',')
    

    Result:

          site_key     site_nice  last_update      odds.h2h         sport_key sport_nice        home_team                               teams
    0  marathonbet  Marathon Bet   1608156452  [1.28, 3.54]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]
    1     sport888      888sport   1608156452   [1.13, 5.8]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]
    2       unibet        Unibet   1608156434   [1.13, 5.8]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]