Search code examples
perlsearchtextphrases

Phrase search in a text file


Given a phrase like "I am searching for a text" and one text file that contains the list of words.

I have to find the whether each and every combination of the word present in the text file.

For example, I have to search for the occurrence "I", "I am", "I am searching", "I am searching for", "searching for" etc.

I prefer to write this in perl and I needed a optimal solution that runs faster.

Example text file :

I \n
am searching \n
Text \n
searching for \n 
searching for a \n
for searching       ---> my program should not match this 
etc

Solution

  • The code below prints all the sub_phrases that you want to match.

    $phrase = 'I am searching for a text';
    $\ = "\n";
    
    @words = ();
    print "Indices:";
    while( $phrase =~ /\b\w+\b/g ) {
        push @words, {word => $&, begin => $-[0], end => $+[0]};
    }
    
    $num_words = $#words + 1;
    print 'there are ', $num_words, ' words';
    
    
    for( $i=0; $i<$num_words; $i++ ) {
        for( $j=$i; $j<$num_words; $j++ ) {
            ($start,$finish) = ($words[$i]->{begin}, $words[$j]->{end});
            $sub_phrase = substr $phrase, $start, $finish-$start;
            print "$i-$j: $sub_phrase";
        }
    }
    

    some explanations:

    1. $\ just to make 'print' easier
    2. $phrase - using your sample
    3. @words is an array of references to records
    4. each record is a hash with the word itself, index to the beginning and index to the end of the word
    5. I've a regular expression, and I'm iterating. I'm looking for a word boundary, 1 or more word character, and a word boundary.
    6. $+ and $- are special variables for the indices of the match of the last RE
    7. $& is a special variable for the match of the last RE
    8. I then have a nested loop: $i, the outer loop variable is the first word. $j is the last word. That covers all the combinations.
    9. I'm calculating $sub_phrase from the beginning of the first word, to the end of the last word.

    To complete your exercise, you want to save all the sub_phrase's into an array (instead of 'print' do 'push' into an @permutations). then iterate through your text file, and for each line, try to match against each permutation.