Search code examples
bashperformanceshellsedlarge-files

apply sed only to the part of the file after last match in loop - shell / bash


I have a couple of large files (~1Gb) of such structure:

fooA iug9wa
fooA lauie
fooA nwgoieb
fooB wilgb
fooB rqgebepu
fooB ifbqeiu
...
fooN ibfiygb
fooN yvsiy
fooN aeviu

I would like to replace in shell each fooX (which contains letters, numbers "." and "_"), (I have all listed in foo.list) to sequential numbers 1 to N.

I've used:

nfoos=$(wc -l < foo.list)

for i in $(seq 1 $nfoos)
do
    currentfoo=$(sed "${i}q;d" foo.list)
    sed -i "s/"${currentfoo}"/$i/g" file1
    sed -i "s/"${currentfoo}"/$i/g" file2
    sed -i "s/"${currentfoo}"/$i/g" filen
done

However, with large files it's been taking forever. Since each consecutive fooX always appears in the files than foo(X-1) I though to make sed only search the part of fileX after the last match of fooX, so that with each foo there is less space to search. I've been trying to use labels and some multiline approaches, but the syntax keeps beating me here.

Does anyone know how to make it work? (Doesn't necessarily have to use sed, but would be great if it worked in basic shell in Bash.)

Appreciate any help. And if you do, please explain each function/option/variable used so that I can figure out where I had been messing up.


Solution

  • You can use awk.
    The first part of the next awk command will fill the array a, the second part replaces the first word.

    awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list file1
    

    When this is what you like, you can loop over your files

    for f in file1 file2 filen; do
      awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list "${f}" > "${f}.tmp" &&
      mv "${f}.tmp" "${f}"
    done
    

    The && makes sure the new file will only replace the original file when awk was OK.