I want to extract in the same order all lines in ids.ped
according to a list of words (second column of list_of_words
) preserving the same order.
ids.ped file:
2425 NA19901 0
2472 NA20291 0
2476 NA20298 0
1328 NA06989 0
...
I want to use awk
and parallel
for this task.
I tried the following:
cut -f2 list_of_words |
parallel -j35 --keep-order \
awk -v id={} 'BEGIN{FS=" "}{if($2 == id){print $2,$3}}' ids.ped
However, I get the error
/bin/bash: -c: line 0: syntax error near unexpected token `('
/bin/bash: -c: line 0: `awk -v id= BEGIN{FS=" "}{if($2 == id){print $2,$3}} ids.ped'
It seems I cannot pass {}
this way.
Notes:
ids.ped
is big, that's way I want to parallelizeawk
since I want to extract lines according to second
column in ids.ped
For some reason I do not understand why grep -w
extracts some lines twice, that is one reason I would rather use awk
.
Any other answer to solve this problem efficiently is welcome. Thanks.
I wasn't able to reproduce your parameter passing problem (do you have empty columns at the beginning of the file?) but I did get the syntax error due to how parallel
its interprets arguments.
/opt/local/bin/bash: -c: line 0: syntax error near unexpected token `('
/opt/local/bin/bash: -c: line 0: `awk -v id=NA20291 BEGIN{FS=" "}{if($2 == id){print $2,$3}} foo.txt'
You've got three choices to fix the problem; you can add the -q
option to parallel
to "protect against evaluation by the subshell":
cut -f2 list_of_words |
parallel -j35 -q --keep-order \
awk -v id="{}" 'BEGIN{FS=" "}{if($2 == id){print $2,$3}}' ids.ped
You can move the awk
code to a separate file; the rest of the command is simple enough that it doesn't need to be escaped:
cut -f2 list_of_words |
parallel -j35 --keep-order awk -v id={} -f foo.awk ids.ped
Contents of foo.awk
:
#!/usr/bin/awk
BEGIN {
FS=" "
}
{
if($2 == id){
print $2,$3
}
}
Or, you can figure out how to escape the command. The manual linked above says "most people will never need more quoting than putting '\' in front of the special characters."
cut -f2 list_of_words |
parallel -j35 --keep-order \
awk -v id="{}" \''BEGIN{FS=" "}{if($2 == id){print $2,$3}}'\' ids.ped