My script processes ~30 lines per second and uses just one CPU core.
while read -r line; do echo "$line" | jq -c '{some-tansfomration-logic}'; done < input.json >> output.json
The input.json
is ~6GB 17M lines file. It's a new-line delimited json, not an array.
I have 16 (or more, if makes sense) cores (vCPUs on GCP) and want to run this process in parallel. I know, hadoop is the way to go. But it's a one-time thing, how do I speed up the process to ~600 lines per second simply?
Lines ordering is not important.
Given input order does have have to match output order, try parallel
:
parallel -j16 --spreadstdin '(transformation)' < input.json > output.json
Notice that parallel
has an option to define the number of jobs based on the number of available cores, to make the script adapt to the actual configuration. Check the man page for options/syntax.
parallel -j0 --spreadstdin '(transformation)' < input.json > output.json
Also this solution will "batch" multiple input lines to jq
, reducing the overhead of running jq per line, as is implemented the original post, and per comment froms for @stkvtflw
The --keep-order
option can be used to force the input order, at some extra processing time. Per OP, not needed.