Search code examples
bashperlsedparallel-processingfasta

Bash: how to optimize/parallelize a search through two large files to replace strings?


I'm trying to figure out a way to speed up a pattern search and replace between two large text files (>10Mb). File1 has two columns with unique names in each row. File2 has one column that contains one of the shared names in File1, in no particular order, with some text underneath that spans a variable number of lines. They look something like this:

File1:

uniquename1 sharedname1
uqniename2 sharedname2
...

File2:

>sharedname45
dklajfwiffwf
flkewjfjfw
>sharedname196
lkdsjafwijwg
eflkwejfwfwf
weklfjwlflwf

My goal is to use File1 to replace the sharedname variables with their corresponding uniquename, as follows:

New File2:

>uniquename45
dklajfwif
flkewjfj
>uniquename196
lkdsjafwij
eflkwejf

This is what I've tried so far:

while read -r uniquenames sharednames; do
    sed -i "s/$sharednames/$uniquenames/g" $File2
done < $File1

It works but it's ridiculously slow, trudging through those big files. The CPU usage is the rate-limiting step, so I was trying to parallel the modification to use the 8 cores at my disposal, but couldn't get it to work. I also tried splitting File1 and File2 into smaller chunks and running in batches simultaneously, but I couldn't get that to work, either. How would you implement this in parallel? Or do you see a different way of doing it?

Any suggestions would be welcomed.

UPDATE 1

Fantastic! Great answers thanks to @Cyrus and @JJoao and suggestions by other commentators. I implemented both in my script, on the recommendation of @JJoao to test the compute times, and it's an improvement (~3 hours instead of ~5). However, I'm just doing text file manipulation so I don't see how it should be taking any more than a couple of minutes. So, I'm still working on making better use of the available CPUs, so I'm tinkering with the suggestions to see if I can speed it up further.

UPDATE 2: correction to UPDATE 1 I included the modifications into my script and run it as such, but a chunk of my code was slowing it down. Instead, I ran the suggested bits of code individually on the target intermediary files. Here's what I saw:

Time for @Cyrus' sed to complete
real    70m47.484s
user    70m43.304s
sys     0m1.092s

Time for @JJoao's Perl script to complete
real    0m1.769s
user    0m0.572s
sys     0m0.244s

Looks like I'll be using the Perl script. Thanks for helping, everyone!

UPDATE 3 Here's the time taken by @Cyrus' improved sed command:

time sed -f <(sed -E 's|(.*) (.*)|s/^\2/>\1/|' File1 | tr "\n" ";") File2
real    21m43.555s
user    21m41.780s
sys     0m1.140s

Solution

  • #!/usr/bin/perl
    
    use strict;
    my $file1=shift;
    my %dic=();
    
    open(F1,$file1) or die("cant find replcmente file\n");
    while(<F1>){                       # slurp File1 to dic
      if(/(.*)\s*(.*)/){$dic{$2}=$1}
    }
    
    while(<>){                         # for all File2 lines
      s/(?<=>)(.*)/ $dic{$1} || $1/e;  # sub ">id" by >dic{id}
      print
    }
    

    I prefer @cyrus solution, but if you need to do that often you can use the previous perl script (chmod + install) as a dict-replacement command.

    Usage: dict-replacement File1 File* > output

    It would be nice if you could tell us the time of the various solutions...