Search code examples

Boosting the grep search using GNU parallel

I am using the following grep script to output all the unmatched patterns:

grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt

patterns file contains the following 12-characters long substrings (some instances are shown below):


large_strings file contains extremely long strings of around 20-100 million characters longs (a small piece of the string is shown below):


How can we speed up the above script (gnu parallel, xargs, fgrep, etc.)? I tried using --pipepart and --block but it doesn't allow you to pipe two grep commands.

Btw these are all hexadecimal strings and patterns.

The working code below is a little faster than the traditional grep:

rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt

grep took an hour to finish the process of pattern matching while it took ripgrep around 45 mins.


  • If you do not need to use grep try:

    build_k_mers() {
        perl -ne 'for $n (0..(length $_)-'"$k"') {                                                                                               
           $prefix = substr($_,$n,2);                                                                                                            
           $fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";                                                                
           $fh = $fh{$prefix};                                                                                                                   
           print $fh substr($_,$n,'"$k"'),"\n"                                                                                                   
    export -f build_k_mers
    rm -rf tmp
    mkdir tmp
    export LC_ALL=C
    # search strings must be sorted for comm                                                                                                     
    parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' &
    # make shorter lines: Insert \n(last 12 char before \n) for every 32k                                                                         
    # This makes it easier for --pipepart to find a newline                                                                                      
    # It will not change the kmers generated                                                                                                     
    perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt
    # Build 12-mers                                                                                                                              
    parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
    # -j10 and 20s may be adjusted depending on hardware
    parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
    parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??

    I have tested this on patterns.txt: 9GBytes/725937231 lines, large_strings.txt: 19GBytes/184 lines and on my 64-core machine it completes in 3 hours.