Search code examples
regexperl

How to extract 2+ character words from string in perl


I assume some sort of regex would be used to accomplish this?

I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.

This is the code I currently have, although it isn't very close to my desired output:

while (my $line=<>) {
  # remove leading and trailing whitespace
  $line =~ s/^\s+|\s+$//g;
  $line = lc $line;
  @array = split / /, $line;
  foreach my $a (@array){
    $a =~ s/[\$#@~!&*()\[\];.,:?^ `\\\/]+//g;
    push(@list, "$a");
  }
}

A sample input would be:

#!/usr/bin/perl -w
use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;

And the desired output would be (alphabetical order):

bin
exit 
hello
hello
line
perl
print
print
strict
this
use
usr
will
world

Solution

  • my @matches = $string =~ /\b([a-z][a-z0-9_]+)/ig;
    

    If case-insensitive operation need be applied only to a subpattern, can embed it

    /... \b((?i)[a-z][a-z0-9_]+) .../
    

    (or, it can be turned off after the subpattern, (?i)pattern(?-i))

    That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.

    The above regex picks words as required without a need to first split the line on space, done in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.

    There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters may there be?


    Note on split-ing on whitespace

    The shown split / /, $line splits on exactly that one space. Better is split /\s+/, $line -- or, better yet is to use split's special pattern split ' ', $line: split on any number of any consecutive whitespace, and where leading and trailing spaces are discarded.


    The shown example is correctly processed as desired by the given regex alone

    use strict;
    use warnings;
    use feature 'say';
    use Path::Tiny qw(path);  # convenience, to slurp the file
    
    my $fn = shift // die "Usage: $0 filename\n";
    
    my @matches = sort map { lc } 
        path($fn)->slurp =~ /\b([a-z][a-z0-9_]+)/ig; 
    
    say for @matches;
    

    I threw in sorting and lower-casing to match the sample code in the question but all processing is done with the shown regex on the file's content in a string.

    Output is as desired (except that line and world here come twice, what is correct).

    Note that lc can be applied on the string with the file content, which is then processed with the regex, what is more efficient. While this is in principle not the same in this case it may be

    perl -MPath::Tiny -wE'$f = shift // die "Need filename\n"; 
        @m = sort lc(path($f)->slurp) =~ /\b([a-z]\w+)/ig; 
        say for @m'
    

    Here I actually used \w. Adjust to the actual character to match, if different.