Search code examples
bashawkprocessing-efficiency

How can I make a script that calls awk in a loop over k/v pairs faster?


I have numerous amounts of text files that I would like to loop through. While looping I would like to find lines that match a list of strings and extract each to a separate folder. I have a variable "ij" that need to be split into "i" and "j" to match two columns. For example 2733 needs to be split into 27 and 33. The script searches each text file and extracts every line that has an i and j of 2733.

The problem here is that I have nearly 100 different strings, so it takes about 35 hours to get through all these strings.

Is there any way to extract all of the variables to separate files in just one loop? I am trying to loop through a text file, extract all the lines that are in my list of strings and output them to their own folder, then move onto the next text file.

I am currently using the "awk" command to accomplish this.


list="2741 2740 2739 2738 2737 2641 2640 2639 2638 2541 2540 2539 2538 2441 2440 2439 2438 2341 2340 2339  2241 2240 2141" 

for string in $list
    do
     for i in  ${string:0:2}
      do
       for j in ${string:2:2}
        do 

          awk -v i=$i -v j=$j '$2==j && $3==i {print $0}' $datadir/*.txt >"${fileout}${i}_${j}_Output.txt"

done
done
done


Solution

  • So I did this:

    # for each 4 digits in the list
    # add "a[" and "];" before and after the four numbers
    # so awk array is "a[2741]; a[2740]; a[2739]; ...."
    awkarray=$(awkarray=$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')
    awk -vfileout="$fileout" '
      BEGIN {'"$awkarray"'}
      $2 $3 in a { 
        print $0 > fileout $2 "_" $3 "_Output.txt"
      }
    ' "$datadir"/*.txt
    

    So first I transform the list to load it as an array in awk. The array has only indexes, so I can check if an index exists in an array, the array elements have no values. Then I simply check if the concatenation of $2 and $3 exists in the array, if it exists, the output is redirected to proper filename.

    Remember to quote your variables. $datadir/*.txt may not work, when datadir contains spaces, do "$datadir"/*.txt. The newlines in awk script above can be removed, so if you prefer a oneliner:

    awk -vfileout="$fileout" 'BEGIN {'"$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')"'} $2 $3 in a { print $0 > fileout $2 "_" $3 "_Output.txt" }' "$datadir"/*.txt