python sqlite cluster-analysis database-performance

Setting up SQLite database for cluster analysis

I'm completely new to databases. I'd like to get some advice on how to set up and use a SQLite database for cluster analysis and topic modeling tasks.

I have a 2 GB file where each line is a json object. Here is an example json object from the file:

{"body": "Heath Ledger's Joker...", "subreddit_id": "t5_2qh3s", "name": "t1_clpmhgo", "author": "l3thaln3ss", "created_utc": "1414799999", "subreddit": "movies", "parent_id": "t3_2kwdi3", "score": 1, "link_id": "t3_2kwdi3", "sub_type": "links - high"}

I have created a SQLite database like so:

import json
import sqlite3
import sys

def main(argv):
    if len(argv) != 2:
        sys.exit("Provide a database name.")

    dbName = argv[1]
    db = sqlite3.connect(dbName)

    db.execute('''CREATE TABLE IF NOT EXISTS Comments
              (name text primary key,
               author text,
               body text,
               score integer,
               parent_id text,
               link_id text,
               subreddit text,
               subreddit_id text,
               sub_type text,
               created_utc text,
               foreign key (parent_id) references Comment(name));''')

    db.commit()
    db.close()

if __name__ == "__main__":
    main(sys.argv)

Is this a good initial setup for the database?

I am populating the database like so:

import json
import sqlite3
import sys

def main(argv):
    if len(argv) != 2:
        sys.exit("Provide a comment file (of json objects) name.")

    fname = argv[1]

    db = sqlite3.connect("commentDB")
    columns = ['name', 'author', 'body', 'score', 'parent_id', 'link_id', 'subreddit', 'subreddit_id', 'sub_type', 'created_utc']

    query = "insert or ignore into Comments values (?,?,?,?,?,?,?,?,?,?)"

    c = db.cursor()

    with open(fname, 'r') as infile:
        for comment in infile:
            decodedComment = json.loads(comment)
            keys = ()
            for col in columns:
                keys += (decodedComment[col],)
            print str(keys)
            print
            c.execute(query, keys)

    c.close()
    db.commit()
    db.close()


if __name__ == "__main__":
    main(sys.argv)

Ultimately, I'm going to be clustering subreddits based on shared frequent words in comments, which users comment where, and differences in topic models obtained by analyzing the words found in subreddit comments. Note that I have many more 2 GB files I'd like to work in, so ideally a solution should be relatively scalable. Any general advice on how to setup (especially by improving what I have written) the database to do this sort of work would be greatly appreciated.

Thanks!

Edit: removed question about insert performance.

Solution

Several minor improvements suggest themselves -- e.g, the CREATE TABLE for Comments has a references Comment(name) where I'm pretty sure Comment is a mis-spelling and you meant Comments (so your code as posted wouldn't work).

Speed-wise, building the peculiarly-named keys as a tuple is somewhat wasteful -- a list would be much better, i.e, replace

        keys = ()
        for col in columns:
            keys += (decodedComment[col],)

with

        keys = [decodedComment[col] for col in columns]

for slightly better performance (it's not clearly documented, perhaps, but the cursor's execute method takes a second arg that's a list just as happily as it takes a tuple).

But overall you have a good start -- and should be fine after ingesting a single 2GB input file. However, sqlite, awesome as it is in many respects, doesn't really scale up well for multiples of that size -- you'll need a "real" database for that. I'd recommend PostgreSQL but probably MySQL (and its variants such as MariaDB) and commercial (non-open-source) offerings would be just fine too.

If by "many more" (2GB files) you mean hundreds, or thousands, even "serious" professional DBs might at some point start creaking at the seams, depending on what processing, exactly, you plan to throw at them; the mention of "every word in the comment" (implying I imagine the body field needs to be processed -- stemming &c -- into a collection of words) is slightly worrisome in promising very heavy processing to come.

Once that becomes a problem, "NoSQL" offerings, or stuff seriously meant to scale up such as e.g BigQuery, may be worth your while. However, for small-scale experimentation, you can surely start with sqlite and use it to develop the algorithms for the "clustering" you have in mind; then scale up to PostgreSQL or whatever to check how those scale on middle-scale work; only at that point, if need be, take the extra work to consider non-relational solutions, which, while very powerful, tend to require commitment to certain patterns of access (relational DBs, where the most you likely need is to add indices, are likely more suitable for more experimental-stage play!).