Search code examples
bashawkdo-while

Awk in do-while loop, read files with multiple columns


I'm writing a script which reads variables from the columns of one file, pattern matches with a column of another file, and prints whole lines from this other file to an output file. I think I've come close to a solution, but only column names in the file outputs are produced (NR==1 is written, nothing else).

while read -r value1 value2
do
        echo "$value2"
        awk -F "\t" 'NR==1; $2 == "$value1" { print $0 }' data.txt > "${value2}".out
done < "var_table.txt" 

Here's an example head of my data:

data.txt

V1  V2  V3  V4
1   dat-131.2   2071.49 3.11E-302
2   dat-219.1   744.48  3.11E-302
3   dat-120.2   13.29   3.11E-302
4   dat-577.1   1.71    3.11E-302
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302
…
    

var_table.txt

dat-501.16  VarA
dat-151.5   VarB
dat-518.9   VarC
dat-535.2   VarD
dat-49.0    VarE

The expected output would be to create a tsv file called VarD.out ("${value2}.out), populated with a header (NR==1), plus whole rows from data.txt corresponding to the value "dat-535.2" in column 2 (print $0 if $2 == "$value1").

VarD.out _|>

V1  V2  V3  V4
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302

I hope this makes sense and is somewhat replicable. I've looked at other answers but am not able to understand how to incorporate multiple columns of data. My guess is the problem is variable assignment because of multiple files, as $2 and $0 correspond to data from data.txt, and value1 and value2 refer to columns of var_table.txt? Any help is welcome.

Thanks


Solution

  • Fixing OP's code to properly pass a bash variable to awk:

    while read -r value1 value2
    do
            echo "$value2"
            awk -F "\t" -v value="$value1" 'NR==1 || $2==value' data.txt > "${value2}".out
    done < "var_table.txt"
    

    Where:

    • -v value="$value1" - assigns the value of the bash variable value1 to the awk varaible value
    • NR==1 || $2==value - if first record of input file, or 2nd field equals the awk variable value (aka bash variable value1) then write to output file

    This generates:

    $ head Var*out
    ==> VarA.out <==
    V1      V2      V3      V4
    
    ==> VarB.out <==
    V1      V2      V3      V4
    
    ==> VarC.out <==
    V1      V2      V3      V4
    
    ==> VarD.out <==
    V1      V2      V3      V4
    5       dat-535.2       16.07   3.11E-302
    6       dat-535.2       12.99   3.11E-302
    7       dat-535.2       2147.31 3.11E-302
    8       dat-535.2       61.21   3.11E-302
    
    ==> VarE.out <==
    V1      V2      V3      V4
    

    A more efficient method where we pull the loop into awk:

    awk '
    BEGIN      { FS=OFS="\t" }
    FNR==NR    { vars[$1]=$2; next }                 # 1st file: save contents to array vars[] using 1st field as the index
    FNR==1     { for (i in vars)                     # 2nd file: write header row to each of our output files (ie, for each index from the vars[] array)
                     print $0 > (vars[i] ".out")
                 next
               }
    $2 in vars { print $0 > (vars[$2] ".out") }      # 2nd file: if 2nd field is an index in array vars[] then print the current line to the associated output file
    ' var_table.txt data.txt
    

    NOTES:

    • this approach will keep open a separate output file descriptor for each row in var_table.txt
    • for a sufficiently 'large' number of rows this could cause some versions of awk to abort due to running out of file descriptors
    • there are a few ways to address this issue (running out of file descriptors) but will require more coding and a better understanding of the input data, eg, will all rows in data.txt (with the same value in the 2nd field) always reside on consecutive lines in data.txt? or could they be spread throughout the file?

    This also generates:

    ==> VarA.out <==
    V1      V2      V3      V4
    
    ==> VarB.out <==
    V1      V2      V3      V4
    
    ==> VarC.out <==
    V1      V2      V3      V4
    
    ==> VarD.out <==
    V1      V2      V3      V4
    5       dat-535.2       16.07   3.11E-302
    6       dat-535.2       12.99   3.11E-302
    7       dat-535.2       2147.31 3.11E-302
    8       dat-535.2       61.21   3.11E-302
    
    ==> VarE.out <==
    V1      V2      V3      V4