Search code examples
pythonjsonsqlite

How do I create sqlite3 table with a dict of json objects where the keys may not match on every item?


I am currently trying to create a series of tables. The first two tables populate fine as every entry has keys that match what the table was created with.

First table: 1 row containing 3 columns (object, has_more, data)

Second table: handful of rows with 11 columns (object, id, type, updated_at, uri, name, description, size, download_uri, content_type, content_encoding)

Third table: using the same method as previous, I pulled the keys and joined as a string to use in the "CREATE TABLE" statement. This worked for the first couple entries but then next JSON object had different keys which caused an error, "table has 64 columns but 60 values were supplied". I'm assuming this will probably happen with more entries, whether more or less columns/values.

Code follows:

import requests
import urllib
import json
import sqlite3
from os import path, replace

BulkDATAurl = "https://api.scryfall.com/bulk-data"      # Store URL path as (str)
BulkDataJSON = json.load(urllib.request.urlopen(BulkDATAurl))

#OPEN DATABASE CONNECTION AND SET CURSOR
connection = sqlite3.connect("JSONTestDB")
cursor = connection.cursor()

#CREATE MAIN BULKDATA TABLE - this works but open to other suggestions
columns = ", ".join('"' + str(x) + '"' for x in BulkDataJSON.keys())
values = ", ".join('"' + str(x) + '"' for x in BulkDataJSON.values())
CreateTableCmd = "CREATE TABLE IF NOT EXISTS TestTable (" + columns + ")"
InsertTableCmd = "INSERT INTO %s (%s) VALUES (%s);" % ('TestTable', columns, values) 
cursor.execute(CreateTableCmd)
cursor.execute(InsertTableCmd)
connection.commit()

#CREATE BULKDATA DATA TABLE - also works like previous
columns = ", ".join('"' + str(x) + '"' for x in  BulkDataJSON['data'][0].keys())
CreateTableCmd = "CREATE TABLE IF NOT EXISTS TestTableData (" + columns + ")"
cursor.execute(CreateTableCmd)
i=0
for item in BulkDataJSON['data']:    
    values = ", ".join('"' + str(x) + '"' for x in  BulkDataJSON['data'][i].values())
    i += 1
    InsertTableCmd = "INSERT INTO %s (%s) VALUES (%s);" % ('TestTableData', columns, values) 
    cursor.execute(InsertTableCmd)
connection.commit()

#CREATE TABLE FROM EXTRACTED URL - this is where my problem happens, 3rd item has less keys, some of which dont even match
oracleURL = BulkDataJSON['data'][2]['download_uri']
oracleJSON = json.load(urllib.request.urlopen(oracleURL))
columns = ", ".join('"' + str(x) + '"' for x in oracleJSON[00000].keys())
CreateTableCmd = "CREATE TABLE IF NOT EXISTS TestTableOracle (" + columns + ")"
cursor.execute(CreateTableCmd)

i=0
for item in oracleJSON[00000]:    
    values =  ", ".join('"' + str(x).replace('"', '-') + '"' for x in  oracleJSON[i].values())
    i += 1
    InsertTableCmd = "INSERT INTO %s (%s) VALUES (%s);" % ('TestTableOracle', columns, values) 
    cursor.execute(InsertTableCmd)
    connection.commit()

connection.commit()
connection.close()

From here I'm not entirely sure where to go, or if there's a more efficient way to carry out what I'm trying to do, any help would be greatly appreciate. Also apologies if there are posts that cover this, I may have missed it or misunderstood it.


Solution

  • Now that I got to an actual terminal, here's a refactoring of your code that runs nicely. I tried to pepper it with comments where necessary.

    The idea is, anyway, that once we've downloaded a file from the API, we iterate over it to get a full list of columns, create a table based on that, and then use the same column list to generate rows (which have Nones in them for columns that don't exist in that particular original row; and where data that can't be handled by SQLite by default is turned back into JSON for storage).

    import json
    import sqlite3
    
    import requests
    
    connection = sqlite3.connect("JSONTestDB.sqlite3")
    
    
    def jsonify_complex_value(val):
        if isinstance(val, (dict, list)):
            return json.dumps(val)
        return val
    
    
    def create_table_from_list(table_name, list_url):
        print(f"Downloading {table_name} from {list_url}")
        data_resp = requests.get(list_url)
        data_resp.raise_for_status()
        data = data_resp.json()
        assert isinstance(data, list)
        # Get all the columns
        all_columns = set()
        for row in data:
            all_columns.update(row.keys())
    
        # Some columns may be reserved words in SQLite, so we need to quote them
        # (but here's hoping there's nothing that needs more escaping)
        columns_for_sql = ",".join(f'"{col}"' for col in all_columns)
        qmarks_for_sql = ", ".join("?" * len(all_columns))
    
        print(f"{table_name} has {len(all_columns)} columns and {len(data)} rows")
    
        cursor = connection.cursor()
        # Do everything here in a transaction for speed
        cursor.execute("BEGIN TRANSACTION")
        # Create the table
        create_table_sql = (
            f"CREATE TABLE IF NOT EXISTS {table_name} ({columns_for_sql})"
        )
        cursor.execute(create_table_sql)
    
        # Create the insert statement
        insert_sql = f"INSERT INTO {table_name} ({columns_for_sql}) VALUES ({qmarks_for_sql})"
    
        # Define a generator that will replace missing columns with None;
        # also, convert complex values to JSON.
        row_gen = (
            tuple(jsonify_complex_value(row.get(col)) for col in all_columns)
            for row in data
        )
    
        # Insert the data by running `insert_sql` in bulk
        cursor.executemany(insert_sql, row_gen)
    
        # Commit the transaction, we're done
        connection.commit()
    
    
    def main():
        bulk_data_resp = requests.get("https://api.scryfall.com/bulk-data")
        bulk_data_resp.raise_for_status()
        for obj in bulk_data_resp.json()["data"]:
            table_name = obj["type"]
            url = obj["download_uri"]
            if table_name not in ("oracle_cards", "default_cards"):
                # Skip things we don't care about for this example
                print("Skipping", table_name)
                continue
            create_table_from_list(table_name, url)
    
    
    if __name__ == "__main__":
        main()
    

    To verify that the data is there, and may contain gorgons:

    $ sqlite3 JSONTestDB.sqlite3 'select name from oracle_cards order by random() limit 1;'
    Vraska the Unseen