Search code examples
jsonregexvbaextractlarge-data

Reg Expression to reduce huge file size?


I have a series of gigantic (40-80mb) exported Google Location History JSON files, with which I've been tasked to analyze select activity data. Unfortunately Google has no parameter or option at their download site to choose anything except "one giant JSON containing forever". (The KML option is twice as big.)

Obvious choices like JSON-Converter (laexcel-test incarnation of VBA-JSON); parsing line-by line with VBA; even Notepad++. They all crash and burn. I'm thinking RegEx might be the answer.

  1. This Python script can extract the timestamp and location from a 40mb file in two seconds (with RegEx?). How is Python doing it so fast? (Would it be as fast in VBA?)

  2. I'd be able to extract everything I need, piece by piece, if only I had a magic chunk of RegEx, perhaps with this logic:

    • Delete everything except:
      When timestampMs and WALKING appear between the *same set of [square brackets] :

      • I need the 13-digit number that follows timestampMS, and,
      • the one- to three- digit number that follows WALKING.

If it's simpler to include a little more data, like "all the timestamps", or "all activities", I could easily sift through it later. The goal is to make the file small enough that I can manipulate it without the need to rent a supercomputer, lol.

I tried adapting existing RegEx's but I have a serious issue with both RegEx and musical instruments: doesn't how hard I try, I just can't wrap my head around it. So, this is indeed a "please write code for me" question, but it's just one expression, and I'll pay it forward by writing code for others today! Thanks... ☺ .

  }, {
    "timestampMs" : "1515564666086",    ◁― (Don't need this but it won't hurt)
    "latitudeE7" : -6857630899, 
    "longitudeE7" : -1779694452999,
    "activity" : [ {
      "timestampMs" : "1515564665992",  ◁― EXAMPLE: I want only this, and...
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 65
      }, {                                              ↓
        "type" : "TILTING",
        "confidence" : 4
      }, {
        "type" : "IN_RAIL_VEHICLE",
        "confidence" : 20                               ↓
      }, {
        "type" : "IN_ROAD_VEHICLE",
        "confidence" : 5
      }, {
        "type" : "ON_FOOT",                             ↓
        "confidence" : 3
      }, {
        "type" : "UNKNOWN",
        "confidence" : 3
      }, {
        "type" : "WALKING",             ◁―┬━━ ...AND, I also want this.
        "confidence" : 3                ◁―┘
      } ]
    } ]
  }, {
    "timestampMs" : "1515564662594",    ◁― (Don't need this but it won't hurt)
    "latitudeE7" : -6857630899, 
    "longitudeE7" : -1779694452999,
    "altitude" : 42
  }, {

Edit:

For testing purposes I made a sample file, representative of the original (except for the size). The raw JSON can be loaded directly from this Pastebin link, or downloaded as a local copy with this TinyUpload link, or copied as "one long line" below:

{"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}

The file tested as valid with JSONLint and FreeFormatter.


Solution

  • Obvious choices ...

    The obvious choice here is a JSON-aware tool that can handle large files quickly. In the following, I'll use jq, which can easily handle gigabyte-size files quickly so long as there is sufficient RAM to hold the file in memory, and which can also handle very large files even if there is insufficient RAM to hold the JSON in memory.

    First, let's assume that the file consists of an array of JSON objects of the form shown, and that the goal is to extract the two values for each admissible sub-object.

    This is a jq program that would do the job:

    .[].activity[]
    | .timestampMs as $ts
    | .activity[]
    | select(.type == "WALKING")
    | [$ts, .confidence]
    

    For the given input, this would produce:

    ["1515564665992",3]
    

    More specifically, assuming the above program is in a file named program.jq and that the input file is input.json, a suitable invocation of jq would be as follows:

    jq -cf program.jq input.json
    

    It should be easy to modify the jq program given above to handle other cases, e.g. if the JSON schema is more complex than has been assumed above. For example, if there is some irregularity in the schema, try sprinkling in some postfix ?s, e.g.:

    .[].activity[]?
    | .timestampMs as $ts
    | .activity[]?
    | select(.type? == "WALKING")
    | [$ts, .confidence]