Search code examples
pythonjsonpandasjson-normalize

Parsing nested json into pandas DataFrame


I am reading data from JSON files that represent football games of tournaments: tournament_1.json, tournament_2.json, ... the files are of the form

{
"name": "tournament_7",
"start_date": "2022-05-01",
"end_date": "2022-06-21",
"games": {  
    "tour_1": [
        {
            "home": "FC Kiruk",
            "away": "BAM united",
            "date": "2022-05-05",
            "result": "home_win",
            "id": "tournament_7_1"
        },
        {
            "home": "River Plays",
            "away": "All Casabianca",
            "date": "2022-05-06",
            "result": "home_win",
            "id": "tournament_7_2"
        },
        {
            "home": "River Plays",
            "away": "FC Kiruk",
            "date": "2022-05-10",
            "result": "away_win",
            "id": "tournament_7_3"
        },
        {
            "home": "BAM United",
            "away": "All Casabianca",
            "date": "2022-05-10",
            "result": "draw",
            "id": "tournament_7_4"
        },
        {
            "home": "FC Kiruk",
            "away": "All Casabianca",
            "date": "2022-05-17",
            "result": "draw",
            "id": "tournament_7_5"
        },
        {
            "home": "River Plays",
            "away": "BAM United",
            "date": "2022-05-18",
            "result": "away_win",
            "id": "tournament_7_6"
        }
    ],
    "tour_2": [
        {
            "home": "FC Kiruk",
            "away": "BAM United",
            "date": "2022-05-22",
            "result": "draw",
            "id": "tournament_7_7"
        },
        {
            "home": "River Plays",
            "away": "FC Kiruk",
            "date": "2022-05-26",
            "result": "home_win",
            "id": "tournament_7_8"
        },
        {
            "home": "BAM United",
            "away": "River Plays",
            "date": "2022-06-04",
            "result": "home_win",
            "id": "tournament_7_9"
        }
    ],
    "tour_3": [
        {
            "home": "FC Kiruk",
            "away": "BAM United",
            "date": "2022-06-12",
            "result": "draw",
            "id": "tournament_7_10"
        },
        {
            "home": "BAM United",
            "away": "FC Kiruk",
            "date": "2022-06-20",
            "result": "home_win",
            "id": "tournament_7_11"
        }
    ]
},
"tours": 3,
"type": "round_robin_all",
"winner": "BAM United"
}

What I did so far is read the JSON file using pandas.normalize_json with

import json
import pandas as pd

with open('tournament_7.json') as data_file:    
    data = json.load(data_file) 
 
df = pd.json_normalize(data, record_path=['games', 'tour_1'], meta=['name','start_date', 'end_date', 'tours', 'type', 'winner'])

which of course allows me to read data about tour_1

I am asking if there is a way to read all of it at once (I need to aggregate all the data frames afterward: read all JSON files ~ 1000s)


Solution

  • You haven't specified exactly what is the output that you are looking for. Here is how I've done it:

    import json
    import pandas as pd
    
    with open('tournament_7.json') as data_file:
        data = json.load(data_file)
    
    df = pd.DataFrame(columns=['name','start_date', 'end_date', 'tours', 'type', 'winner'])
    
    for tour in data["games"]:
        df = pd.concat([df,(pd.json_normalize(data, record_path=['games', tour], meta=['name','start_date', 'end_date', 'tours', 'type', 'winner']))], ignore_index = True, axis = 0)
    
    print(df)
    

    So I just loop through the different tours that are present in the games dictionary. Then I concat the resultant DataFrame. You may want to add a column that specifies which tour this row is for but that is up to you.