Search code examples
for-loopawkunix-text-processing

AWK: Concatenate and process three or more files with a method similar to FNR==NR approach


Since I am learning awk; I found out FNR==NR approach is a very common method to process two files. If FNR==NR; then it is the first file, when FNR reset to 1 while reading every line from concatenated files it means !(FNR==NR) and it is obviously the second file.

When it comes to three or more files I can't see a way which is second and third file as both have the same !(FNR==NR) condition. This made me to try to figure out how can there be something like FNR2 and FNR3?

So I implemented a method to process three files in one awk. Assuming like there is FNR1 FNR2 FNR3 for each file. For every file I made for loop that runs seperately. Condition is same for every loop NR==FNR# and actually get what I expected:

So I wonder if there are more sober, concise methods that deliver similar results with belowawkcode

Sample File Contents

$ cat file1
X|A1|Z
X|A2|Z
X|A3|Z
X|A4|Z
$ cat file2
X|Y|A3
X|Y|A4
X|Y|A5
$ cat file3
A1|Y|Z
A4|Y|Z

AWK for loop

    $ cat fnrarray.sh 
awk -v FS='[|]' '{ for(i=FNR ; i<=NR && i<=FNR && NR==FNR; i++)         {x++; print "NR:",NR,"FNR1:",i,"FNR:",FNR,"\tfirst file\t"}
                   for(i=FNR ; i+x<=NR && i<=FNR && NR==FNR+x; i++)     {y++; print "NR:",NR,"FNR2:",i+x,"FNR:",FNR,"\tsecond file\t"}
                   for(i=FNR ; i+x+y<=NR && i<=FNR && NR==FNR+x+y; i++) {print "NR:",NR,"FNR3:",i+x+y,"FNR:",FNR,"\tthird file\t"}
}' file1 file2 file3 

Current and desired output

$ sh fnrarray.sh
NR: 1 FNR1: 1 FNR: 1    first file  
NR: 2 FNR1: 2 FNR: 2    first file  
NR: 3 FNR1: 3 FNR: 3    first file  
NR: 4 FNR1: 4 FNR: 4    first file  
NR: 5 FNR2: 5 FNR: 1    second file 
NR: 6 FNR2: 6 FNR: 2    second file 
NR: 7 FNR2: 7 FNR: 3    second file 
NR: 8 FNR3: 8 FNR: 1    third file  
NR: 9 FNR3: 9 FNR: 2    third file

You can see NR is aligning with FNR# and it is readable which NR is for which file#.


Another Method

I found this method FNR==1{++f} f==1 {} here Handling 3 Files using awk

But this method is replacing arr1[1] when new line is read every time

Fail attempt 1

$ awk -v FS='[|]' 'FNR==1{++f} f==1 {split($2,arr); print arr1[1]}' file1 file2 file3 
A1
A2
A3
A4

Success with for loop (arr1[1] is not changed)

$ awk -v FS='[|]' '{for(i=FNR ; i<=NR && i<=FNR && NR==FNR; i++) {arr1[++k]=$2; print arr1[1]}}' file1 file2 file3 
A1
A1
A1
A1


Solution

  • To identify files in order using GNU awk no matter what:

    awk '
        ARGIND == 1 { do 1st file stuff }
        ARGIND == 2 { do 2nd file stuff }
        ARGIND == 3 { do 3rd file stuff }
    ' file1 file2 file3
    

    e.g. to get the text under "output" in your question from the 3 sample input files you provided:

    awk '
        ARGIND == 1 { pos = "first" }
        ARGIND == 2 { pos = "second" }
        ARGIND == 3 { pos = "third" }
        { print "NR:", NR, "FNR" ARGIND ":", NR, "FNR:", FNR, pos " file" }
    ' file1 file2 file3
    NR: 1 FNR1: 1 FNR: 1 first file
    NR: 2 FNR1: 2 FNR: 2 first file
    NR: 3 FNR1: 3 FNR: 3 first file
    NR: 4 FNR1: 4 FNR: 4 first file
    NR: 5 FNR2: 5 FNR: 1 second file
    NR: 6 FNR2: 6 FNR: 2 second file
    NR: 7 FNR2: 7 FNR: 3 second file
    NR: 8 FNR3: 8 FNR: 1 third file
    NR: 9 FNR3: 9 FNR: 2 third file
    

    or using any awk if all file names are unique whether any of them are empty or not:

    awk '
        FILENAME == ARGV[1] { do 1st file stuff }
        FILENAME == ARGV[2] { do 2nd file stuff }
        FILENAME == ARGV[3] { do 3rd file stuff }
    ' file1 file2 file3
    

    or if the files aren't empty then whether unique or not (note file1 twice in the arg list):

    awk '
        FNR == 1 { argind++ }
        argind == 1 { do 1st file stuff }
        argind == 2 { do 2nd file stuff }
        argind == 3 { do 3rd file stuff }
    ' file1 file2 file1
    

    if a file names can appear multiple times in the arg list and some of the files could be empty then it becomes trickier with a non-GNU awk which is why GNU awk has ARGIND, e.g. something like (untested):

    awk '
        BEGIN {
            for (i=1; i<ARGC; i++) {
                fname = ARGV[i]
                if ( (getline line < fname) > 0 ) {
                    # file is not empty so save its position in the args
                    # list in an array indexed by its name and the number
                    # of times that name has been seen so far
                    arginds[fname,++tmpcnt[fname]] = i
                }
                close(fname)
            }
        }
        FNR == 1 { argind = arginds[FILENAME,++cnt[FILENAME]] }
        argind == 1 { do 1st file stuff }
        argind == 2 { do 2nd file stuff }
        argind == 3 { do 3rd file stuff }
    ' file1 file2 file1