I have two files, file1.txt
and file2.txt
. file1.txt
has about 14K lines and file2.txt
has about 2 billions. file1.txt
has a single field f1
per line while file2.txt
has 3 fields, f1
through f3
, delimited by |
.
I want to find all lines from file2.txt
where f1
of file1.txt
matches f2
of file2.txt
(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt
).
file1.txt (about 14K lines, not sorted):
foo1
foo2
...
bar1
bar2
...
file2.txt (about 2 billion lines, not sorted):
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Output expected:
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Here is what I have tried and it seems to be taking several hours to run:
fgrep -F -f file1.txt file2.txt > file.matched
I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.
A small piece of Perl code solved the problem. This is the approach taken:
file1.txt
in a hashfile2.txt
line by line, parse and extract the second fieldHere is the code:
#!/usr/bin/perl -w
use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/\|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time
output for the same:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
I ran @Inian's awk
code:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt
and producing 40K matched lines. Here is how long it took:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1
real 55m5.539s
user 54m53.080s
sys 0m5.095s
Using @Inian's other awk
solution, which eliminates the looping issue:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
is very impressive here, given that we didn't have to write an entire program to do it.
I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s
user 806m59.219s
sys 1m12.147s
I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
I didn't check join
approach - I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
Thanks everyone for your attention and responses.