Search code examples
jsonbashselectjqprocessing-efficiency

Select lots of known IDs from a big JSON document efficiently


I am trying to get some value from json via jq in bash. With small value it work nice but with big json it work too slow, like 1 value for each 2-3 second. Example of my code:

json=$(curl -s -A "some useragent"  "url" )
pid=$(cat idlist.json |  jq '.page_ids[]')
for id in $pid
do
echo $pagejson|jq -r '.page[]|select(.id=='$id')|.url'>>path.url
done

The "pid" is list of id that I type before running script. It may contain 700-1000 id. Example object of json

{
"page":[
{
"url":"some url",
"id":some numbers
},
{

"url":"some url",
"id":some numbers
}
]
}

Is there any way to speed up it? In javascript it work faster than it. Example of javascript:

//First sort object with order
var url="";
var sortedjson= ids.map(id => obj.find(page => page.id === id));
//Then collect url
for ( x=0 ; x < sortedjson.length;x++) {
url+=sortedjson[x].url
};

Should I sort json like in javascript for better performance? I don't tried it because don't know how.

Edit: Replaced "pid" variable with json to use less code and for id in $(echo $pid) with for id in $pid. But it still slow down if id list more than about 50


Solution

  • Calling jq once per id is always going to be slow. Don't do that -- call jq just once, and have it match against the full set.

    You can accomplish that by passing the entire comma-separated list of ids into your one copy of jq, and letting jq itself do the work of splitting that string into individual items (and then putting them in a dictionary for fast access)

    For example:

    pid="24885,73648,38758,8377,747"
    jq --arg pidListStr "$pid" '
      ($pidListStr | [split(",")[] | {(.): true}] | add) as $pidDict |
      .page[] | select($pidDict[.id | tostring]) | .url
    ' <<<"$pagejson"