Getting information from SQL table with Python

I have a 20gb database file (*.db) that contains a lot of information on scientific articles, acquired from Scopus. Most information is stored in the responses table within the database file. I only want to obtain the subject areas from each published article. I have no idea how to get this from the database file.

In each row in the database file a lot of information is stored on the author, the article, and also the subject area. A snippet looks like this:

{"abstracts-retrieval-response":{"coredata":{"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/85012897283","dc:identifier":"SCOPUS_ID:85012897283","eid":"2-s2.0-85012897283","language":{"@xml:lang": "eng"},"authkeywords":{"author-keyword": [{"@_fa": "true", "$" :"austerity policies"},{"@_fa": "true", "$" :"housing policy"},{"@_fa": "true", "$" :"Italy"},{"@_fa": "true", "$" :"Mediterranean welfare regime"},{"@_fa": "true", "$" :"Neoliberalism"},{"@_fa": "true", "$" :"Spain"}]},"idxterms":null,"subject-areas":{"subject-area": [{"@_fa": "true", "@abbrev": "SOCI", "@code": "3303", "$" :"Development"},{"@_fa": "true", "@abbrev": "SOCI", "@code": "3322", "$" :"Urban Studies"},{"@_fa": "true", "@abbrev": "ENVI", "@code": "2308", "$" :"Management, Monitoring, Policy and Law"}]}}

From this large (but in reality even much larger) table, I am only interested in getting the dc-identifier and the (multiple) subject-areas. Ideally I would get them in a *.csv file.

Is there a straightforward way to obtain this information from the *.db file using Python? Using sqlite3 I seem to be able to get acces to the database using the following code:

import sqlite3
conn = sqlite3.connect('C:/responses.db')       
c = conn.cursor()

To me it's not clear how I can now only get the dc-identifier and subject-areas information from the database file.

Solution

Using the json.loads() function will convert the input JSON string to a Python object consisting of a hierarchical set of dictionaries and lists. You then need to extract the information from that using standard Python operators. Following is an example, though I'm not sure that the example pulls the correct document ID. This is also untested, so you may need to tweak or correct it. Also, your example input has unbalanced curly braces, so the interpretation of the structure used below may not be quite right.

import sqlite3
import json

conn = sqlite3.connect('C:/responses.db')       
c = conn.cursor()

# Initialize the output list.
subjectlist = []

# Get the data from SQLite.
c.execute("select * from responses;")

# Iterate over all the rows of data
for row in c:
    # Extract and save the subject information.
    article = json.loads(row[0])
    doc_id = article["abstracts-retrieval-response"]["coredata"]["dc:identifier"]
    subjects = [s["$"] for s in article["abstracts-retrieval-response"]["coredata"]["subject-areas"]["subject-area"]]
    for s in subjects:
        subjectlist.append([doc_id, s])

At the conclusion of this code, subjectlist will be a list of two-element lists, each of the latter consisting of the document ID and a subject area. You can then use the csv library to export this, or maybe push it back into the database as a new table to make further querying easier.