Search code examples
awkgawknawk

How to handle 3 files with awk?


Ok, so after spending 2 days, I am not able solve it and I am almost out of time now. It might be a very silly question, so please bear with me. My awk script does something like this:

BEGIN{ n=50; i=n; }
FNR==NR {
            # Read file-1, which has just 1 column
            ids[$1]=int(i++/n);
            next
        }
        {
            # Read file-2 which has 4 columns
            # Do something
            next
        }
 END {...}

It works fine. But now I want to extend it to read 3 files. Let's say, instead of hard-coding the value of "n", I need to read a properties file and set value of "n" from that. I found this question and have tried something like this:

BEGIN{ n=0; i=0; }
FNR==NR {
            # Block A
            # Try to read file-0
            next
        }
        {
            # Block B
            # Read file-1, which has just 1 column
            next
        }
        {
            # Block C
            # Read file-2 which has 4 columns
            # Do something
            next
        }
 END {...}

But it is not working. Block A is executed for file-0, I am able to read the property from properties files. But Block B is executed for both files file-1 and file-2. And Block C is never executed.

Can someone please help me solve this? I have never used awk before and the syntax is very confusing. Also, if someone can explain how awk reads input from different files, that will be very helpful.

Please let me know if I need to add more details to the question.


Solution

  • Update: The solution below works, as long as all input files are nonempty, but see @Ed Morton's answer for a simpler and more robust way of adding file-specific handling.

    However, this answer still provides a hopefully helpful explanation of some awk basics and why the OP's approach didn't work.


    Try the following (note that I've made the indices 1-based, as that's how awk does it):

    awk '
    
     # Increment the current-file index, if a new file is being processed.
     FNR == 1 { ++fIndex }
    
     # Process current line if from 1st file.
     fIndex == 1 {
        print "file 1: " FILENAME
        next
     }
    
     # Process current line if from 2nd file.
     fIndex == 2 {
        print "file 2: " FILENAME
        next
     }
    
     # Process current line (from all remaining files).
     {
        print "file " fIndex ": " FILENAME
     }
    
    ' file-1 file-2 file-3
    
    • Pattern FNR==1 is true whenever a new input file is starting to get processed (FNR contains the input file-relative line number).
    • Every time a new file starts processing, fIndexis incremented and thus reflects the 1-based index of the current input file. Tip of the hat to @twalberg's helpful answer.

      • Note that an uninitialized awk variable used in a numeric context defaults to 0, so there's no need to initialize fIndex (unless you want a different start value).
    • Patterns such as fIndex == 1 can then be used to execute blocks for lines from a specific input file only (assuming the block ends in next).
    • The last block is then executed for all input files that don't have file-specific blocks (above).

    As for why your approach didn't work:

    • Your 2nd and 3rd blocks are potentially executed unconditionally, for lines from all input files, because they are not preceded by a pattern (condition).

    • So your 2nd block is entered for lines from all subsequent input files, and its next statement then prevents the 3rd block from ever getting reached.

    Potential misconceptions:

    • Perhaps you think that each block functions as a loop processing a single input file. This is NOT how awk works. Instead, the entire awk program is processed in a loop, with each iteration processing a single input line, starting with all lines from file 1, then from file 2, ...

    • An awk program can have any number of blocks (typically preceded by patterns), and whether they're executed for the current input line is solely governed by whether the pattern evaluates to true; if there is no pattern, the block is executed unconditionally (across input files). However, as you've already discovered, next inside a block can be used to skip subsequent blocks (pattern-block pairs).