I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab.
And also I have a file that contains list of keywords to be matched. Say this file act as a look up.
So for each keyword in the look up table I need to search all its occurrence in the given file. And should print the line number of the occurrence.
I have tried this
#!usr/bin/perl
use strict;
use warnings;
my $linenum = 0;
print "Enter the file path of lookup table:";
my $filepath1 = <>;
print "Enter the file path that contains keywords :";
my $filepath2 = <>;
open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;
open OUT, ">", "SampleLineNum.txt";
while( $line = <FILE1> )
{
while( <FILE2> )
{
$linenum = $., last if(/$line/);
}
print OUT "$linenum ";
}
close FILE1;
This gives the first occurrence of the keyword. But I need all the occurrence and also the keyword should be exactly match.
The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world"
if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number.
Here is a solution that matches every occurrence of all keywords:
#!usr/bin/perl
use strict;
use warnings;
#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords, '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt' or die "Can't open search file: $!";
my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;
while (<$search_file>)
{
while (/$regex/g)
{
print "$.: $1\n";
}
}
keywords.txt:
hello
foo
bar
search.txt:
plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone
Output:
4: bar
4: bar
4: bar
5: hello
7: hello
Explanation:
This creates a single regex that matches all of the keywords in the keywords file.
<$keywords>
- when this is used in list context, it returns a list of all lines of the file.
map {chomp;qr/\Q$_\E/}
- this removes the newline from each line and applies the \Q...\E
quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter).
join '|',
- join the resulting list into a single string, separated by pipe characters.
my $regex = qr|\b($keyword_or)\b|;
- create a regex that looks like this:
/\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/
This regex will match any of your keywords. \b
is the word boundary marker, ensuring that only whole words match: food
no longer matches foo
. The parentheses capture the specific keyword that matched in $1
. This is how the output prints the keyword that matched.
I updated the solution to match each keyword on a given line and to only match complete words.