Search code examples
perlgnu-parallelparallelism-amdahl

How to replace multiple patterns in the same file, based on the line's first word?


I have a list of phrases in a single file ( "phrases" ), each being on its own line.

I also have another file, which contains a list of words, each on a line ("words").

I wish to append an asterisk at the end of every phrase in "phrases", which begins with a word listed in "words".

For example:

File "phrases":

gone are the days
hello kitty
five and a half
these apples are green

File "words":

five
gone

Expected result in "phrases" after the operation:

gone are the days *
hello kitty
five and a half *
these apples are green

What I have done so far is this:

parallel -j0 -a words -q perl -i -ne 'print "$1 *" if /^({}\s.*)$/' phrases

But this truncates the file and sometimes (not always) gives me this error:

Can't remove phrases: No such file or directory, skipping file.

Because the edits will be made in concurrently, my intention is for it to search and replace ONLY those lines which start with the word while leaving the other ones intact. Otherwise the parallel concurrent execution will overwrite each other.

I am open to other concurrent methods as well.


Solution

  • This is not a good fit for parallel processing, because by far the most expensive operation you can do - usually - is reading from disk. CPU is much much faster.

    Your problem is not CPU intensive, and so you won't gain much advantage to running in parallel. And worse - as you've found - you induce a race condition that can lead to file clobbering.

    Practically speaking disk IO is done in chunks - multiple K - from the disk, that's fetched into cache and then fed to the OS in such a way that you can pretend that read works byte-by-byte.

    If you read a file sequentially, predictive fetch allows the OS to be even more efficient about it, and just pull the whole file into cache as fast as possible, massively speeding up the processing.

    Trying to parallelise and interleave this process at best has no effect, and can make things worse.

    So with that in mind, you'd be better off not trying to parallel, and instead:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    open ( my $words_fh, '<', 'words' ) or die $!; 
    my $words = join '|', map { s/\n//r } <$words_fh>;
       $words = qr/\b(?:$words)\b/;
    close ( $words_fh );
    
    print "Using match regex of: ", $words, "\n";
    
    open ( my $phrases_fh, '<', 'phrases' ) or die $!;
    while ( <$phrases_fh> ) { 
      if (m/$words/) {
          s/$/ */;
      }
      print;
    } 
    

    Redirect output to the desired location.

    The most expensive bit is the reading of files - it does this once and one only. Invoking the regex engine repeatedly for the same line for each search term would also be expensive, because you'd be doing it N * M times, where N is the number of words, and M is the number of lines.

    So instead we compile a single regex, and match that, using the zero width \b word boundary marker (so it won't substring match).

    Note - we don't quote the contents of words - that may be a bug or a feature, because it means you could add regex into the mix. (And that might break when we compile our regex).

    If you want to ensure it's 'literal', then:

    my $words = join '"', map { quotemeta } map { s/\n//r } <$words_fh>;