I'm reading huge json files with jq
, something like:
jq -r '[.a, .b, .time] | @tsv' file.txt
those files are coming as gz
files and I spend each day 20 minutes just to gunzip
them.
I was wondering is it possible to read the files with jq
directly from the gz
format? and if so, will it faster overall or will it slow down my process?
If it takes 20 minutes to unzip, it's going to take 20 minutes to unzip whether the library is used by gunzip
or by jq
.
But, you could avoid writing the unzipped file to disk and the time related to that. This would be achieved by using one of the following:
gunzip -c file.gz | jq -r '[ .a, .b, .time ] | @tsv' >file.tsv
gunzip <file.gz | jq -r '[ .a, .b, .time ] | @tsv' >file.tsv
To be clear, the above uses minimal memory given that the input is a series of small JSON documents. (The input is a series of JSON documents of the form {"a": "a", "b": "a", "time": "20210210T10:10:00"}
.) Not one of the three files (compressed, decompressed or TSV) is found in memory in its entirety at any time.
The following demonstrates the streaming nature of jq
:
$ (
j='{"a": "a", "b": "a", "time": "20210210T10:10:00"}'
printf '%s\n' "$j"
printf '%s\n' "$j"
sleep 4
printf '%s\n' "$j"
) | jq -r '[ .a, .b, .time, now ] | @tsv'
a a 20210210T10:10:00 1620305187.460741
a a 20210210T10:10:00 1620305187.460791
[4 second pause]
a a 20210210T10:10:00 1620305191.459734
The first two records are emitted without delay, and the third is emitted after 4 seconds. This is reflected by the timestamps.