Search code examples
arraysbashperformanceawkcomparison

Improving performance of comparing file column to array in bash


I am comparing rows in a series of files against an array of floats in bash. Briefly, the files in question (../summarize_eigenvectors/"$xct"_/-5_1-sorted.txt) have the following structure

    c    v    weight        ik        kx        ky        kz
    1    1   0.00000         1   0.00000   0.00000   0.00000
    1    1   0.00000         2   0.00000   0.04167   0.00000
    1    1   0.00000         3   0.00000   0.08333   0.00000
  

and the array is generated from ../vici_absdipole_noeh/v1c5.data, which has the following format:

kx      ky      kz          ik    ic    iv    is    ec (eV)         ev (eV)        eig (eV)   abs(dipole)^2      Re(dipole)      Im(dipole)
0.00000 0.00044 0.00000      1     1     1     1  0.11713703E+01 -0.12426462E+01  0.24140165E+01  0.69913425E-04  0.81359347E-02  0.19287282E-02
0.00000 0.01883 0.00000      2     1     1     1  0.11760590E+01 -0.12490846E+01  0.24251436E+01  0.59405512E-04 -0.70114501E-03  0.76755396E-02
0.00000 0.03722 0.00000      3     1     1     1  0.11746489E+01 -0.12612625E+01  0.24359113E+01  0.37648401E-04 -0.46637404E-02  0.39872204E-02
0.00000 0.05561 0.00000      4     1     1     1  0.11868220E+01 -0.12787400E+01  0.24655620E+01  0.18552618E-04 -0.21585915E-02  0.37273450E-02

What my code does is to compare the integers in the 4th column of the file v1c5.da against my array; if for a row of the file, the 4th column element is in the "list" array, then values of the file index, weight, ik, kx and ky in that row are echoed out.

Here's a sample of my working code, which compares 7485 files that has 1152 lines each to an array of 1152 elements

#!/bin/bash

#generates the array from information given in another file
for i in range $(seq 2 1153)
do
    ik=$(awk -v i=$i 'NR==i''{ print$4 }'  ../vici_absdipole_noeh/v1c5.data)
    dp2=$(awk -v i=$i 'NR==i''{ print$11 }'  ../vici_absdipole_noeh/v1c5.data)
    dp2f=$(printf "%.8f" $dp2)
    if (( $(echo "$dp2f > 6" |bc -l) )); then
        list+=("$ik" )
    fi
done

echo "xct   ik  weight  kx  ky" > v1c5-high_dp2_kpts.txt

task(){
    echo working on $xct
    for line in {1..1152};do
        weight=$(awk -v line=$line 'NR==line''{ print$3 }'  ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
        ik=$(awk -v line=$line 'NR==line''{ print$4 }'  ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
        kx=$(awk -v line=$line 'NR==line''{ print$5 }'  ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
        ky=$(awk -v line=$line 'NR==line''{ print$6 }'  ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
        if [[ " ${list[@]} " =~ " ${ik} " ]]; then
            echo "$xct  $ik $weight $kx $ky" >> v1c5-high_dp2_kpts.txt
        fi
    done
}

for xct in {1..7485};do
((i=i%360)); ((i++==0)) && wait
task "$xct" &
done
wait

The code has run for 8 hours plus and has only processed 1700 files, which is rather slow. Is there any bottleneck in this code that's limiting its performance? And if there is, how can I improve it?

I'm running this on the high performance computing center node that has 24 cores per node, hence I've used parallelization as well to speed things up. Apparently this is still not enough.


Solution

  • Your task function is really inefficient. It rereads all the files 4 times each time through the loop, just to process one line. You can do all the work in one awk invocation.

    task(){
        echo working on $xct
        cat ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt |
            awk -v list="${list[*]}" -v xct="$xct" '
                BEGIN {
                    split(list, list_array); # split string list into array list_array at whitespace delimiters
                    for (i in list_array) list_hash[list_array[i]] = 1 # associative array with $list elements as keys
                }
                {
                weight=$3; ik=$4; kx=$5; ky=$6;
                if (ik in list_hash) printf("%s  %s %s %s %s\n", xct, ik, weight, kx, ky)
                }' >> v1c5-high_dp2_kpts.txt
    }