Search code examples
pythonjsoncsvtype-conversion

Converting large (140MB) JSON file into CSV in Python


I've been trying to do a data wrangling exercise with large files and I chose this big file, I googled around and saw that it is fairly straightforward in Python, but I'm not the most well versed with coding in yet. So I'd need a bit more explanation (Explain like I'm five) on how to set it up properly. I've done the 'saving file with another name to format it' but that is how it got formatted. This is how it is formatted as

I'm not sure if it is supposed to look this way from the python codes I found, so, if anyone could help or clarify it to me. This is a code I found.

import json
import csv
 
with open('G:\Akhil\jsonoutput.json') as json_file:
    jsondata = json.load(json_file)
 
data_file = open('G:\Akhil\jsonoutput.csv', 'w', newline='')
csv_writer = csv.writer(data_file)
 
count = 0
for data in jsondata:
    if count == 0:
        header = data.keys()
        csv_writer.writerow(header)
        count += 1
    csv_writer.writerow(data.values())
 
data_file.close()

Is that code correct? If yes how do I edit it to convert it? How do I download and save that file so I can open it on excel? Thanks in advance

This is a new thing to me so I'm trying to figure it out, but I'm stuck in what way is the correct one, as I'm not even sure the JSON file is formatted correctly


Solution

  • When converting anything (JSON, XML, ...) to CSV you need to have a clear idea of what you want the final CSV to look like.

    For JSON that's just a list of flat objects:

    [
        {"id": "1", "name": "foo", "score": 0},
        {"id": "2", "name": "bar", "score": 0},
        {"id": "1", "name": "foo", "score": 1},
        {"id": "3", "name": "baz", "score": 0},
        {"id": "3", "name": "baz", "score": 2}
    ]
    

    We can visualize that easily as this CSV:

    id,name,score
    1,foo,0
    2,bar,0
    1,foo,1
    3,baz,0
    3,baz,2
    

    and the Python code to do the transform:

    import csv
    import json
    
    with open("input_flat.json") as f:
        data = json.load(f)
    
    with open("output_flat.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        for row in data:
            writer.writerow(row)
    

    Since the JSON was just an array of objects, or in Python terms, a list of dicts, we can use the DictWriter to write each individual dict (row) in data. We also have to create the DictWriter with the fieldnames it will be looking for in the individual dicts, so data[0].keys() to get the keys from the first dict in the data.

    For JSON a little more like yours, where we have a nested structure, kind of like:

    {
        "1": {
            "name": "foo",
            "data": [
                {"score": 0},
                {"score": 1}
            ]
        },
        "2": {
            "name": "bar",
            "data": [
                {"score": 0}
            ]
        },
        "3": {
            "name": "baz",
            "data": [
                {"score": 0},
                {"score": 2}
            ]
        }
    }
    

    Should the CSV look similar to the CSV from above? (only you, dear programmer, can answer that question) If so, we can still use the DictWriter like above, but now the row-dict isn't handed to us, we need to build it ourselves:

    with open("input_nested.json") as f:
        all_data = json.load(f)
    
    rows = []
    for id_, obj in all_data.items():
        for data in obj["data"]:
            row = {
                "id": id_,
                "name": obj["name"],
                "score": data["score"],
            }
    
            rows.append(row)
    
    with open("output_nested.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        for row in rows:
            writer.writerow(row)
    

    I renamed data to all_data when I loaded the JSON. Next, we flatten each object in all_data, and its sub-objects (under the "data" key), down to a single row; or, "build up" the row-dict from the bottom of the structure. (I also had to be careful and name my variable in the outer loop id_ because id is resevered word in Python.) And that gives us:

    id,name,score
    1,foo,0
    1,foo,1
    2,bar,0
    3,baz,0
    3,baz,2
    

    Not identical to the one above, because of sorting, but pretty much equal.

    I wasn't able to find the actual JSON you used, despite searching OWID for about 20 minutes, but from your screenshot these structures are very similar. If you can follow along with the simple foo-bar-baz examples and understand the structure of the nested JSON, and then how to work with it in Python, you can start designing your final, desired CSV.