I am comparing rows in a series of files against an array of floats in bash. Briefly, the files in question (../summarize_eigenvectors/"$xct"_/-5_1-sorted.txt) have the following structure
c v weight ik kx ky kz
1 1 0.00000 1 0.00000 0.00000 0.00000
1 1 0.00000 2 0.00000 0.04167 0.00000
1 1 0.00000 3 0.00000 0.08333 0.00000
and the array is generated from ../vici_absdipole_noeh/v1c5.data, which has the following format:
kx ky kz ik ic iv is ec (eV) ev (eV) eig (eV) abs(dipole)^2 Re(dipole) Im(dipole)
0.00000 0.00044 0.00000 1 1 1 1 0.11713703E+01 -0.12426462E+01 0.24140165E+01 0.69913425E-04 0.81359347E-02 0.19287282E-02
0.00000 0.01883 0.00000 2 1 1 1 0.11760590E+01 -0.12490846E+01 0.24251436E+01 0.59405512E-04 -0.70114501E-03 0.76755396E-02
0.00000 0.03722 0.00000 3 1 1 1 0.11746489E+01 -0.12612625E+01 0.24359113E+01 0.37648401E-04 -0.46637404E-02 0.39872204E-02
0.00000 0.05561 0.00000 4 1 1 1 0.11868220E+01 -0.12787400E+01 0.24655620E+01 0.18552618E-04 -0.21585915E-02 0.37273450E-02
What my code does is to compare the integers in the 4th column of the file v1c5.da against my array; if for a row of the file, the 4th column element is in the "list" array, then values of the file index, weight, ik, kx and ky in that row are echoed out.
Here's a sample of my working code, which compares 7485 files that has 1152 lines each to an array of 1152 elements
#!/bin/bash
#generates the array from information given in another file
for i in range $(seq 2 1153)
do
ik=$(awk -v i=$i 'NR==i''{ print$4 }' ../vici_absdipole_noeh/v1c5.data)
dp2=$(awk -v i=$i 'NR==i''{ print$11 }' ../vici_absdipole_noeh/v1c5.data)
dp2f=$(printf "%.8f" $dp2)
if (( $(echo "$dp2f > 6" |bc -l) )); then
list+=("$ik" )
fi
done
echo "xct ik weight kx ky" > v1c5-high_dp2_kpts.txt
task(){
echo working on $xct
for line in {1..1152};do
weight=$(awk -v line=$line 'NR==line''{ print$3 }' ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
ik=$(awk -v line=$line 'NR==line''{ print$4 }' ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
kx=$(awk -v line=$line 'NR==line''{ print$5 }' ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
ky=$(awk -v line=$line 'NR==line''{ print$6 }' ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt)
if [[ " ${list[@]} " =~ " ${ik} " ]]; then
echo "$xct $ik $weight $kx $ky" >> v1c5-high_dp2_kpts.txt
fi
done
}
for xct in {1..7485};do
((i=i%360)); ((i++==0)) && wait
task "$xct" &
done
wait
The code has run for 8 hours plus and has only processed 1700 files, which is rather slow. Is there any bottleneck in this code that's limiting its performance? And if there is, how can I improve it?
I'm running this on the high performance computing center node that has 24 cores per node, hence I've used parallelization as well to speed things up. Apparently this is still not enough.
Your task
function is really inefficient. It rereads all the files 4 times each time through the loop, just to process one line. You can do all the work in one awk
invocation.
task(){
echo working on $xct
cat ../summarize_eigenvectors/"$xct"_*/*-5_1-sorted.txt |
awk -v list="${list[*]}" -v xct="$xct" '
BEGIN {
split(list, list_array); # split string list into array list_array at whitespace delimiters
for (i in list_array) list_hash[list_array[i]] = 1 # associative array with $list elements as keys
}
{
weight=$3; ik=$4; kx=$5; ky=$6;
if (ik in list_hash) printf("%s %s %s %s %s\n", xct, ik, weight, kx, ky)
}' >> v1c5-high_dp2_kpts.txt
}