How to normalize json file containing a list (that should be kept as a list) in Python | Pandas?

I'm trying to use the json_normalize function to convert a json file into a dataframe.

Source JSON

The json is a list of dictionaries that look something like this:

{
      "sport_key": "basketball_ncaab",
      "sport_nice": "NCAAB",
      "teams": [
          "Bryant Bulldogs",
          "Wagner Seahawks"
      ],
      "commence_time": 1608152400,
      "home_team": "Bryant Bulldogs",
      "sites": [
          {
              "site_key": "marathonbet",
              "site_nice": "Marathon Bet",
              "last_update": 1608156452,
              "odds": {
                  "h2h": [
                      1.28,
                      3.54
                  ]
              }
          },
          {
              "site_key": "sport888",
              "site_nice": "888sport",
              "last_update": 1608156452,
              "odds": {
                  "h2h": [
                      1.13,
                      5.8
                  ]
              }
          },
          {
              "site_key": "unibet",
              "site_nice": "Unibet",
              "last_update": 1608156434,
              "odds": {
                  "h2h": [
                      1.13,
                      5.8
                  ]
              }
          }
      ],
      "sites_count": 3
  }

The problem is that one of the future columns contains a list (which should be the case), but including this column in the meta part of the json_normalize function throws the following error:

ValueError: operands could not be broadcast together with shape (22,) (11,)

The error appears when I try to add "teams" in the list in the following code:

pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])

Solution

Assuming data is a list of dictionaries, you can still use json_normalize but you have to assign the teams column seperately for each corresponding dictionary in data:

def normalize(d):
    return pd.json_normalize(d, 'sites', ['sport_key', 'sport_nice', 'home_team'])\
           .assign(teams=[d['teams']]*len(d['sites']))


df = pd.concat([normalize(d) for d in data], ignore_index=True)

Alternatively you can try:

data = [{**d, 'teams': ','.join(d['teams'])} for d in data]
df = pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
df['teams'] = df['teams'].str.split(',')

Result:

      site_key     site_nice  last_update      odds.h2h         sport_key sport_nice        home_team                               teams
0  marathonbet  Marathon Bet   1608156452  [1.28, 3.54]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]
1     sport888      888sport   1608156452   [1.13, 5.8]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]
2       unibet        Unibet   1608156434   [1.13, 5.8]  basketball_ncaab      NCAAB  Bryant Bulldogs  [Bryant Bulldogs, Wagner Seahawks]