Search code examples
gzipjq

Is it possible to read gzip file directly with jq?


I'm reading huge json files with jq, something like:

jq -r '[.a, .b, .time] | @tsv' file.txt

those files are coming as gz files and I spend each day 20 minutes just to gunzip them. I was wondering is it possible to read the files with jq directly from the gz format? and if so, will it faster overall or will it slow down my process?


Solution

  • If it takes 20 minutes to unzip, it's going to take 20 minutes to unzip whether the library is used by gunzip or by jq.

    But, you could avoid writing the unzipped file to disk and the time related to that. This would be achieved by using one of the following:

    gunzip -c file.gz | jq -r '[ .a, .b, .time ] | @tsv' >file.tsv
    
    gunzip <file.gz | jq -r '[ .a, .b, .time ] | @tsv' >file.tsv
    

    To be clear, the above uses minimal memory given that the input is a series of small JSON documents. (The input is a series of JSON documents of the form {"a": "a", "b": "a", "time": "20210210T10:10:00"}.) Not one of the three files (compressed, decompressed or TSV) is found in memory in its entirety at any time.

    The following demonstrates the streaming nature of jq:

    $ (
       j='{"a": "a", "b": "a", "time": "20210210T10:10:00"}'
       printf '%s\n' "$j"
       printf '%s\n' "$j"
       sleep 4
       printf '%s\n' "$j"
    ) | jq -r '[ .a, .b, .time, now ] | @tsv'
    a       a       20210210T10:10:00       1620305187.460741
    a       a       20210210T10:10:00       1620305187.460791
    [4 second pause]
    a       a       20210210T10:10:00       1620305191.459734
    

    The first two records are emitted without delay, and the third is emitted after 4 seconds. This is reflected by the timestamps.