Search code examples
bashloopslarge-datalarge-files

Loop through a really large list


I need to go through a really large vcf file to find matching information (matching rows according to column values).

Here is something I tried so far, but it is not working and really problematic.

target_id=('id1' 'id2' 'id3' ...)

awk '!/#/' file_in | cut -f3,10-474|
for id in $target_id
do
    grep "target"
done

It only loop through the file looking for the first id in the target_id list.

I'm wondering is there a way to loop through the file looking for all the ids in the target_id list? And I want to output the entire row (3rd, 10-474th column) if 3rd colmn is matching.


Solution

  • You may get the same behaviour as the for loop using a single grep for a bunch of target_id at once, using, for example;

    egrep "id1|id2|id3"
    

    This might improve the performance, as you don't have to fork a new instance of grep for each target_id .

    You mentioned that the file_in (vcf file) is huge. As long as the filesystem limits are not reached, you won't get into trouble. For example, ext2, ext3 had a max file size of 2 Tb, ext4 has max file size of 16 Tb.

    You may encounter issues regarding size of command line arguments, if the $target_id list is too big however.

    Please find the resulting code below; (note that |\ is used to write a very long command using multiple lines. the \ tells the shell that the command continues on next line)

    #!/bin/bash
    
    target_id="id1 id2 id3"
    
    awk '!/#/' file_in | \
    cut -f3,10-474| \
    egrep "$(echo $target_id | tr ' ' '|')"