I need to find and replace keywords from a hash in a large documents as fast as possible. I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.
The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.
Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my %dictionary = (
pollack => "pollard",
polynya => "polyoma",
pomaces => "pomaded",
pomades => "pomatum",
practic => "praetor",
prairie => "praised",
praiser => "praises",
prajnas => "praline",
quakily => "quaking",
qualify => "quality",
quamash => "quangos",
quantal => "quanted",
quantic => "quantum",
);
my $content =qq{
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
};
# just duplicate content many times
$content .= $content;
cmpthese(100000, {
replacer_1 => sub {my $text = replacer1($content)},
replacer_2 => sub {my $text = replacer2($content)},
});
print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;
sub replacer1 {
my ($content) = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub replacer2 {
my ($content) = shift;
my @names = $content =~ /\{(.+?)\}/g;
foreach my $name (@names) {
if (exists $dictionary{$name}) {
$content =~ s/\{$name\}/\[$dictionary{$name}\]/;
}
}
return $content;
}
Here is the benchmark result:
Rate replacer_2 replacer_1
replacer_2 5565/s -- -76%
replacer_1 23397/s 320% --
I'd recommend using meaningful names for your benchmarking subroutines, it'll make the output and intent more clear.
The following reproduces a bit of what Borodin and mob have tried out, and then combines them as well.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'state';
use Benchmark qw(:all);
# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};
# Quadruple Content
$content = $content x 4;
cmpthese(100_000, {
original => sub { my $text = original($content) },
build_list => sub { my $text = build_list($content) },
xor_regex => sub { my $text = xor_regex($content) },
list_and_xor => sub { my $text = list_and_xor($content) },
});
exit;
sub original {
my $content = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub build_list {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
$content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
return $content;
}
sub xor_regex {
my $content = shift;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
sub list_and_xor {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted
quantic quantum
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
Outputs:
Rate original xor_regex build_list list_and_xor
original 19120/s -- -23% -24% -29%
xor_regex 24938/s 30% -- -1% -8%
build_list 25253/s 32% 1% -- -7%
list_and_xor 27027/s 41% 8% 7% --
My solutions make heavy use of state
variables to avoid reinitializing static data structures. However, one could also use closures or our $var; $var ||= VAL
.
Actually, editing the LHS to use an explicit list is about improving the regular expression. And this change showed a 30% improvement in speed.
There isn't likely to be any magic solution to this. You have a list of values, that you're wanting to replace. It isn't like there is some mysterious way to simplify the language of this goal.
You could perhaps use a code block in the LHS to Fail and skip if the word does not exist in the dictionary hash. However, the following shows that this is actually 36% slower than your original method:
sub skip_fail {
my $content = shift;
$content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;
return $content;
}
Outputs:
Rate skip_fail original xor_regex build_list list_and_xor
skip_fail 6769/s -- -36% -46% -49% -53%
original 10562/s 56% -- -16% -21% -27%
xor_regex 12544/s 85% 19% -- -6% -14%
build_list 13355/s 97% 26% 6% -- -8%
list_and_xor 14537/s 115% 38% 16% 9% --