Search code examples
pythondatabasebigdatagoogle-bigquerydata-analysis

Large Data Analytics


I'm trying to analyze a large amount of GitHub Archive Data and am stumped by many limitations.

So my analysis requires me too search a 350GB Data set. I have a local copy of the data and there is also a copy available via Google BigQuery. The local dataset is split up into 25000 individual files. The dataset is a timeline of events.

I want to plot the number of stars each repository has since its creation. (Only for repos with > 1000 currently)

I can get this result very quickly using Google BigQuery, but it "analyzes" 13.6GB of data each time. This limits me to <75 requests without having to pay $5 per additional 75.

My other option is to search through my local copy, but searching through each file for a specific string (repository name) takes way too long. Took over an hour on an SSD drive to get through half the files before I killed the process.

What is a better way I can approach analyzing such a large amount of data?

Python Code for Searching Through all Local Files:

for yy in range(11,15):
                    for mm in range(1,13):
                        for dd in range(1,32):
                            for hh in range(0,24):
                                counter = counter + 1
                                if counter < startAt:
                                    continue    
                                if counter > stopAt:
                                    continue
                                #print counter
                                strHH = str(hh)
                                strDD = str(dd)
                                strMM = str(mm)
                                strYY = str(yy)
                                if len(strDD) == 1:
                                    strDD = "0" + strDD
                                if len(strMM) == 1:
                                    strMM = "0" + strMM
                                #print strYY + "-" + strMM + "-" + strDD + "-" + strHH
                                try:
                                    f = json.load (open ("/Volumes/WD_1TB/GitHub Archive/20"+strYY+"-"+strMM+"-"+strDD+"-"+strHH+".json", 'r') , cls=ConcatJSONDecoder)
                                    for each_event in f:
                                        if(each_event["type"] == "WatchEvent"):
                                            try:
                                                num_stars = int(each_event["repository"]["watchers"])
                                                created_at = each_event["created_at"]
                                                json_entry[4][created_at] = num_stars
                                            except Exception, e:
                                                print e
                                except Exception, e:
                                    print e

Google Big Query SQL Command:

SELECT repository_owner, repository_name, repository_watchers, created_at
  FROM [githubarchive:github.timeline]
  WHERE type = "WatchEvent"
  AND repository_owner = "mojombo"
  AND repository_name = "grit"
  ORDER BY created_at

I am really stumped so any advice at this point would be greatly appreciated.


Solution

  • I found a solution to this problem - Using a database. i imported the relevant data from my 360+GB of JSON data to a MySQL Database and queried that instead. What used to be a 3hour+ query time per element became <10seconds.

    MySQL wasn't the easiest thing to set up, and import took approximately ~7.5 hours, but the results made it well worth it for me.