Search code examples
jsonstreamgeojsonjqbigdata

Process huge GEOJson file with jq


Given a GEOJson file as follows:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

I want to end up with the following:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "tippecanoe" : {"minzoom" : 13},
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

ie. I have added the tippecanoe object to each feature in the array features

I can make this work with:-

 jq '.features[].tippecanoe.minzoom = 13' <GEOJSON FILE> > <OUTPUT FILE>

Which is fine for small files. But processing a large file of 414Mb seems to take forever with the processor maxing out and nothing being written to the OUTPUT FILE

Reading further into jq it appears that the --stream command line parameter may help but I am completely confused as to how to use this for my purposes.

I would be grateful for an example command line that serves my purposes along with an explanation as to what --stream is doing.


Solution

  • A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.

    The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.

    In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.

    jq-only

    jq -c  '.features[]' input.json |
        jq -c '.tippecanoe.minzoom = 13' |
        jq -c -s '{type: "FeatureCollection", features: .}'
    

    jq and awk

    jq -c '.features[]' input.json |
       jq -c '.tippecanoe.minzoom = 13' | awk '
         BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
         NR==1 { print; next }
               {print ","; print}
         END   {print "] }";}'
    

    Performance comparison

    For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.

    u+s:

    jq-only:              15m 15s
    jq-awk:                7m 40s
    jq one-pass using map: 6m 53s