Search code examples
shellfilefile-comparisondata-comparison

Fastest way to find lines in file1 which contains any keywords from file2?


I have two files. The first file has three hundred thousand records shown example (file1) and second file has one hundred thousand records, shown example (file2). I am basically doing a grep of each entry on file2 with file1 and retrieve all that is in file1. I am doing this with a normal for loop:

for i in `cat file2.txt`; do cat file1 | grep -i -w $i; done > /var/tmp/file3.txt

As the data is too huge it takes me 8+ hours to complete this operation.. I need your expertise in giving me an option on how can I do this in an efficient way to deliver this in less than 2-3 hours.

Example entries

File1

server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |  
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |  
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |  
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |  
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash | 

File2

user1  
user2  
user3  

Solution

  • Give this a shot.

    Test Data:

    %_Host@User> head file1.txt file2.txt
    ==> file1.txt <==
    server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
    server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
    server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
    server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |
    server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |
    
    ==> file2.txt <==
    user1
    user2
    user3
    #user4
    %_Host@User>
    

    Output:

        %_Host@User> ./2comp.pl file1.txt file2.txt   ; cat output_comp
        server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
        server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
        server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
        %_Host@User>
        %_Host@User>
    

    Script: Please give this one more try. Re-check the file order. File1 first and then file second: ./2comp.pl file1.txt file2.txt.

    %_Host@User> cat 2comp.pl
    #!/usr/bin/perl
    
    use strict ;
    use warnings ;
    use Data::Dumper ;
    
    my ($file2,$file1,$output) = (@ARGV,"output_comp") ;
    my (%hash,%tmp) ;
    
    (scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;
    
    for (@ARGV) {
      open FH, "<$_" || die "Cannot open $_\n" ;
      while (my $line = <FH>){$line =~ s/^.+[()].+$| +?$//g ; chomp $line ; $hash{$_}{$line} = "$line"}
      close FH ;}
    
    open FH, ">>$output" || die "Cannot open outfile!\n" ;
    foreach my $k1 (keys %{$hash{$file1}}){
      foreach my $k2 (keys %{$hash{$file2}}){
        if ($k2 =~ m/^.+?$k1.+?$/i){    # Case Insensitive matching.
          if (!defined $tmp{"$hash{$file2}{$k2}"}){
            print FH "$hash{$file2}{$k2}\n" ;
            $tmp{"$hash{$file2}{$k2}"} = 1 ;
                    }}}} close FH  ;
    # End.
    %_Host@User>
    

    Thanks good luck.