Search code examples
perlstring-matchingline-numbers

Find all the occurrence of string in a file and print its line number in Perl


I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab.

And also I have a file that contains list of keywords to be matched. Say this file act as a look up.

So for each keyword in the look up table I need to search all its occurrence in the given file. And should print the line number of the occurrence.

I have tried this

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

open OUT, ">", "SampleLineNum.txt";

while( $line = <FILE1> )
{
    while( <FILE2> ) 
    {
        $linenum = $., last if(/$line/);
    }
    print OUT "$linenum ";
}

close FILE1;

This gives the first occurrence of the keyword. But I need all the occurrence and also the keyword should be exactly match.

The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world"

if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number.


Solution

  • Here is a solution that matches every occurrence of all keywords:

    #!usr/bin/perl
    use strict;
    use warnings;
    
    #Lexical variable for filehandle is preferred, and always error check opens.
    open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
    open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";
    
    my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
    my $regex = qr|\b($keyword_or)\b|;
    
    while (<$search_file>)
    {
        while (/$regex/g)
        {
            print "$.: $1\n";
        }
    }
    

    keywords.txt:

    hello
    foo
    bar
    

    search.txt:

    plonk
    food is good
    this line doesn't match anything
    bar bar bar
    hello world
    lalalala
    hello everyone
    

    Output:

    4: bar
    4: bar
    4: bar
    5: hello
    7: hello
    

    Explanation:

    This creates a single regex that matches all of the keywords in the keywords file.

    <$keywords> - when this is used in list context, it returns a list of all lines of the file.

    map {chomp;qr/\Q$_\E/} - this removes the newline from each line and applies the \Q...\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter).

    join '|', - join the resulting list into a single string, separated by pipe characters.

    my $regex = qr|\b($keyword_or)\b|; - create a regex that looks like this:

    /\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/

    This regex will match any of your keywords. \b is the word boundary marker, ensuring that only whole words match: food no longer matches foo. The parentheses capture the specific keyword that matched in $1. This is how the output prints the keyword that matched.

    I updated the solution to match each keyword on a given line and to only match complete words.