I need to go through a really large vcf file to find matching information (matching rows according to column values).
Here is something I tried so far, but it is not working and really problematic.
target_id=('id1' 'id2' 'id3' ...)
awk '!/#/' file_in | cut -f3,10-474|
for id in $target_id
do
grep "target"
done
It only loop through the file looking for the first id in the target_id list.
I'm wondering is there a way to loop through the file looking for all the ids in the target_id list? And I want to output the entire row (3rd, 10-474th column) if 3rd colmn is matching.
You may get the same behaviour as the for loop using a single grep for a bunch of target_id at once, using, for example;
egrep "id1|id2|id3"
This might improve the performance, as you don't have to fork a new instance of grep for each target_id .
You mentioned that the file_in (vcf file) is huge. As long as the filesystem limits are not reached, you won't get into trouble. For example, ext2, ext3 had a max file size of 2 Tb, ext4 has max file size of 16 Tb.
You may encounter issues regarding size of command line arguments, if the $target_id list is too big however.
Please find the resulting code below; (note that |\ is used to write a very long command using multiple lines. the \ tells the shell that the command continues on next line)
#!/bin/bash
target_id="id1 id2 id3"
awk '!/#/' file_in | \
cut -f3,10-474| \
egrep "$(echo $target_id | tr ' ' '|')"