I'm completely new to databases. I'd like to get some advice on how to set up and use a SQLite database for cluster analysis and topic modeling tasks.
I have a 2 GB file where each line is a json object. Here is an example json object from the file:
{"body": "Heath Ledger's Joker...", "subreddit_id": "t5_2qh3s", "name": "t1_clpmhgo", "author": "l3thaln3ss", "created_utc": "1414799999", "subreddit": "movies", "parent_id": "t3_2kwdi3", "score": 1, "link_id": "t3_2kwdi3", "sub_type": "links - high"}
I have created a SQLite database like so:
import json
import sqlite3
import sys
def main(argv):
if len(argv) != 2:
sys.exit("Provide a database name.")
dbName = argv[1]
db = sqlite3.connect(dbName)
db.execute('''CREATE TABLE IF NOT EXISTS Comments
(name text primary key,
author text,
body text,
score integer,
parent_id text,
link_id text,
subreddit text,
subreddit_id text,
sub_type text,
created_utc text,
foreign key (parent_id) references Comment(name));''')
db.commit()
db.close()
if __name__ == "__main__":
main(sys.argv)
Is this a good initial setup for the database?
I am populating the database like so:
import json
import sqlite3
import sys
def main(argv):
if len(argv) != 2:
sys.exit("Provide a comment file (of json objects) name.")
fname = argv[1]
db = sqlite3.connect("commentDB")
columns = ['name', 'author', 'body', 'score', 'parent_id', 'link_id', 'subreddit', 'subreddit_id', 'sub_type', 'created_utc']
query = "insert or ignore into Comments values (?,?,?,?,?,?,?,?,?,?)"
c = db.cursor()
with open(fname, 'r') as infile:
for comment in infile:
decodedComment = json.loads(comment)
keys = ()
for col in columns:
keys += (decodedComment[col],)
print str(keys)
print
c.execute(query, keys)
c.close()
db.commit()
db.close()
if __name__ == "__main__":
main(sys.argv)
Ultimately, I'm going to be clustering subreddits based on shared frequent words in comments, which users comment where, and differences in topic models obtained by analyzing the words found in subreddit comments. Note that I have many more 2 GB files I'd like to work in, so ideally a solution should be relatively scalable. Any general advice on how to setup (especially by improving what I have written) the database to do this sort of work would be greatly appreciated.
Thanks!
Edit: removed question about insert performance.
Several minor improvements suggest themselves -- e.g, the CREATE TABLE
for Comments
has a references Comment(name)
where I'm pretty sure Comment
is a mis-spelling and you meant Comments
(so your code as posted wouldn't work).
Speed-wise, building the peculiarly-named keys
as a tuple
is somewhat wasteful -- a list
would be much better, i.e, replace
keys = ()
for col in columns:
keys += (decodedComment[col],)
with
keys = [decodedComment[col] for col in columns]
for slightly better performance (it's not clearly documented, perhaps, but the cursor's execute
method takes a second arg that's a list just as happily as it takes a tuple).
But overall you have a good start -- and should be fine after ingesting a single 2GB input file. However, sqlite, awesome as it is in many respects, doesn't really scale up well for multiples of that size -- you'll need a "real" database for that. I'd recommend PostgreSQL
but probably MySQL
(and its variants such as MariaDB
) and commercial (non-open-source) offerings would be just fine too.
If by "many more" (2GB files) you mean hundreds, or thousands, even "serious" professional DBs might at some point start creaking at the seams, depending on what processing, exactly, you plan to throw at them; the mention of "every word in the comment" (implying I imagine the body
field needs to be processed -- stemming &c -- into a collection of words) is slightly worrisome in promising very heavy processing to come.
Once that becomes a problem, "NoSQL" offerings, or stuff seriously meant to scale up such as e.g BigQuery, may be worth your while. However, for small-scale experimentation, you can surely start with sqlite and use it to develop the algorithms for the "clustering" you have in mind; then scale up to PostgreSQL or whatever to check how those scale on middle-scale work; only at that point, if need be, take the extra work to consider non-relational solutions, which, while very powerful, tend to require commitment to certain patterns of access (relational DBs, where the most you likely need is to add indices, are likely more suitable for more experimental-stage play!).