Search code examples
regexperlrex

Regex simple replace document from dictionary hash (Perl)


I need to find and replace keywords from a hash in a large documents as fast as possible. I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.

The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.

Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(:all);

my %dictionary = (
            pollack => "pollard",
            polynya => "polyoma",
            pomaces => "pomaded",
            pomades => "pomatum",
            practic => "praetor",
            prairie => "praised",
            praiser => "praises",
            prajnas => "praline",
            quakily => "quaking",
            qualify => "quality",
            quamash => "quangos",
            quantal => "quanted", 
            quantic => "quantum",
    );

my $content =qq{
        Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
        can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
        Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
        one kind of problem {qualify} {doesNotExist} end.
    };

# just duplicate content many times
$content .= $content;

cmpthese(100000, {
    replacer_1 => sub {my $text = replacer1($content)},
    replacer_2 => sub {my $text = replacer2($content)},
});

print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;

sub replacer1 {
    my ($content) = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub replacer2 {
    my ($content) = shift;
    my @names = $content =~ /\{(.+?)\}/g;
    foreach my $name (@names) {
        if (exists $dictionary{$name}) {
            $content =~ s/\{$name\}/\[$dictionary{$name}\]/;
        }
    }
    return $content;
}

Here is the benchmark result:

              Rate replacer_2 replacer_1
replacer_2  5565/s         --       -76%
replacer_1 23397/s       320%         --

Solution

  • I'd recommend using meaningful names for your benchmarking subroutines, it'll make the output and intent more clear.

    The following reproduces a bit of what Borodin and mob have tried out, and then combines them as well.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use feature 'state';
    
    use Benchmark qw(:all);
    
    # Data separated by paragraph mode.
    my %dictionary = split ' ', do {local $/ = ''; <DATA>};
    my $content = do {local $/; <DATA>};
    
    # Quadruple Content
    $content = $content x 4;
    
    cmpthese(100_000, {
        original        => sub { my $text = original($content) },
        build_list      => sub { my $text = build_list($content) },
        xor_regex       => sub { my $text = xor_regex($content) },
        list_and_xor    => sub { my $text = list_and_xor($content) },
    });
    
    exit;
    
    sub original {
        my $content = shift;
        $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
        return $content;
    }
    
    sub build_list {
        my $content = shift;
        state $list = join '|', map quotemeta, keys %dictionary;
        $content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
        return $content;
    }
    
    sub xor_regex {
        my $content = shift;
    
        state $with_brackets = {
            map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
        };
    
        $content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;
    
        return $content;
    }
    
    sub list_and_xor {
        my $content = shift;
    
        state $list = join '|', map quotemeta, keys %dictionary;
        state $with_brackets = {
            map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
        };
    
        $content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;
    
        return $content;
    }
    
    __DATA__
    pollack pollard
    polynya polyoma
    pomaces pomaded
    pomades pomatum
    practic praetor
    prairie praised
    praiser praises
    prajnas praline
    quakily quaking
    qualify quality
    quamash quangos
    quantal quanted 
    quantic quantum
    
    Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
    can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
    Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
    one kind of problem {qualify} {doesNotExist} end.
    

    Outputs:

                    Rate     original    xor_regex   build_list list_and_xor
    original     19120/s           --         -23%         -24%         -29%
    xor_regex    24938/s          30%           --          -1%          -8%
    build_list   25253/s          32%           1%           --          -7%
    list_and_xor 27027/s          41%           8%           7%           --
    

    My solutions make heavy use of state variables to avoid reinitializing static data structures. However, one could also use closures or our $var; $var ||= VAL.

    Addendum about enhancing the LHS of the regex

    Actually, editing the LHS to use an explicit list is about improving the regular expression. And this change showed a 30% improvement in speed.

    There isn't likely to be any magic solution to this. You have a list of values, that you're wanting to replace. It isn't like there is some mysterious way to simplify the language of this goal.

    You could perhaps use a code block in the LHS to Fail and skip if the word does not exist in the dictionary hash. However, the following shows that this is actually 36% slower than your original method:

    sub skip_fail {
        my $content = shift;
    
        $content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;
    
        return $content;
    }
    

    Outputs:

                    Rate   skip_fail    original   xor_regex build_list list_and_xor
    skip_fail     6769/s          --        -36%        -46%       -49%         -53%
    original     10562/s         56%          --        -16%       -21%         -27%
    xor_regex    12544/s         85%         19%          --        -6%         -14%
    build_list   13355/s         97%         26%          6%         --          -8%
    list_and_xor 14537/s        115%         38%         16%         9%           --