Search code examples
bashcurlgnu-parallel

Optimising my script code for GNU parallels


I have a script which queries successfully an API, but is very slow. It will take around 16 hours to get all the resources. I looked at how I could optimise it, and I thought that using GNU parallels (installed on macos via Brew, version 20180522) would do the trick. But even with using 90 jobs (the API endpoints authorizes 100 connections max), my script is not faster. I'm not sure why.

I call my script like so:

bash script.sh | parallel -j90

The script is the following:

#!bin/bash 

# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment


# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file 
main(){
local file="${1}"
local line
local sign
local auteur_clean
local cosign_clean

while read line
    do
        sign="${line}/signataires"
        auteur_clean=$(auteur $sign)
        cosign_clean=$(cosignataires $sign)
        echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}

# The auteur function takes the $sign variable as an input and 
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
#  3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url="${1}"
local auteur
local auteur_nom

auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}

# The cosignataires function takes the $sign variable as an input and 
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url="${1}"
local cosign
local cosign_nom
local i

cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}

main "url_amendements_15.txt"

and the content of url_amendements_15.txt looks like so:

https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161

Solution

  • Your script loops over a list of URLs and queries them sequentially. You need to break it up so each API query is done separately, that way parallel will have commands it can execute in parallel.

    Change the script so it takes a single URL. Get rid of the main while loop.

    main() {
        local url=$1
        local sign
        local auteur_clean
        local cosign_clean
    
        sign=$url/signataires
        auteur_clean=$(auteur "$sign")
        cosign_clean=$(cosignataires "$sign")
        echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
    }
    

    Then pass url_amendements_15.txt to parallel. Give it the list of URLs that can be processed in parallel.

    parallel -j90 script.sh < url_amendements_15.txt