Search code examples
linuxbashperlawkgrep

Fastest way to find lines of a file from another larger file in Bash


I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt (about 14K lines, not sorted):

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.


Solution

  • A small piece of Perl code solved the problem. This is the approach taken:

    • store the lines of file1.txt in a hash
    • read file2.txt line by line, parse and extract the second field
    • check if the extracted field is in the hash; if so, print the line

    Here is the code:

    #!/usr/bin/perl -w
    
    use strict;
    if (scalar(@ARGV) != 2) {
      printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
      exit(2);
    }
    
    my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
    my ($small_fp, $big_fp, %small_hash, $field);
    
    open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
    open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;
    
    # store contents of small file in a hash
    while (<$small_fp>) {
      chomp;
      $small_hash{$_} = undef;
    }
    close($small_fp);
    
    # loop through big file and find matches
    while (<$big_fp>) {
      # no need for chomp
      $field = (split(/\|/, $_))[1];
      if (defined($field) && exists($small_hash{$field})) {
        printf("%s", $_);
      }
    }
    
    close($big_fp);
    exit(0);
    

    I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time output for the same:

    real    0m11.694s
    user    0m11.507s
    sys 0m0.174s
    

    I ran @Inian's awk code:

    awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
    

    It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

    awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
     input record number 675280, file file2.txt
     source line number 1
    
    real    55m5.539s
    user    54m53.080s
    sys 0m5.095s
    

    Using @Inian's other awk solution, which eliminates the looping issue:

    time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
    
    real    0m39.966s
    user    0m37.916s
    sys 0m0.743s
    
    time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
    
    real    0m41.057s
    user    0m38.475s
    sys 0m0.904s
    

    awk is very impressive here, given that we didn't have to write an entire program to do it.

    I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

    real    895m14.862s
    user    806m59.219s
    sys 1m12.147s
    

    I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted error, even with very small block sizes.


    What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

    I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe join would have done better than the Perl code.

    Thanks everyone for your attention and responses.