I'm writing a script which reads variables from the columns of one file, pattern matches with a column of another file, and prints whole lines from this other file to an output file. I think I've come close to a solution, but only column names in the file outputs are produced (NR==1 is written, nothing else).
while read -r value1 value2
do
echo "$value2"
awk -F "\t" 'NR==1; $2 == "$value1" { print $0 }' data.txt > "${value2}".out
done < "var_table.txt"
Here's an example head of my data:
data.txt
V1 V2 V3 V4
1 dat-131.2 2071.49 3.11E-302
2 dat-219.1 744.48 3.11E-302
3 dat-120.2 13.29 3.11E-302
4 dat-577.1 1.71 3.11E-302
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
…
var_table.txt
dat-501.16 VarA
dat-151.5 VarB
dat-518.9 VarC
dat-535.2 VarD
dat-49.0 VarE
The expected output would be to create a tsv file called VarD.out ("${value2}.out), populated with a header (NR==1), plus whole rows from data.txt corresponding to the value "dat-535.2" in column 2 (print $0 if $2 == "$value1").
VarD.out _|>
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
I hope this makes sense and is somewhat replicable. I've looked at other answers but am not able to understand how to incorporate multiple columns of data. My guess is the problem is variable assignment because of multiple files, as $2 and $0 correspond to data from data.txt, and value1 and value2 refer to columns of var_table.txt? Any help is welcome.
Thanks
Fixing OP's code to properly pass a bash
variable to awk
:
while read -r value1 value2
do
echo "$value2"
awk -F "\t" -v value="$value1" 'NR==1 || $2==value' data.txt > "${value2}".out
done < "var_table.txt"
Where:
-v value="$value1"
- assigns the value of the bash
variable value1
to the awk
varaible value
NR==1 || $2==value
- if first record of input file, or 2nd field equals the awk
variable value
(aka bash
variable value1
) then write to output fileThis generates:
$ head Var*out
==> VarA.out <==
V1 V2 V3 V4
==> VarB.out <==
V1 V2 V3 V4
==> VarC.out <==
V1 V2 V3 V4
==> VarD.out <==
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
==> VarE.out <==
V1 V2 V3 V4
A more efficient method where we pull the loop into awk
:
awk '
BEGIN { FS=OFS="\t" }
FNR==NR { vars[$1]=$2; next } # 1st file: save contents to array vars[] using 1st field as the index
FNR==1 { for (i in vars) # 2nd file: write header row to each of our output files (ie, for each index from the vars[] array)
print $0 > (vars[i] ".out")
next
}
$2 in vars { print $0 > (vars[$2] ".out") } # 2nd file: if 2nd field is an index in array vars[] then print the current line to the associated output file
' var_table.txt data.txt
NOTES:
var_table.txt
awk
to abort due to running out of file descriptorsdata.txt
(with the same value in the 2nd field) always reside on consecutive lines in data.txt
? or could they be spread throughout the file?This also generates:
==> VarA.out <==
V1 V2 V3 V4
==> VarB.out <==
V1 V2 V3 V4
==> VarC.out <==
V1 V2 V3 V4
==> VarD.out <==
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
==> VarE.out <==
V1 V2 V3 V4