I am trying to convert a set of regular expression from Adblock Plus rules into an optimized function I could call from C++.
I was expecting to be able to use a lexer generator such as Ragel to do this but when I try with a very small set of Regex the memory usage go very high > 30 GB and Ragel quit without error message and without producing the output file.
I included the toy grammar bellow, I am trying to understand if I am doing anything stupid that could be optimized to solve the issue.
#include <string.h>
namespace xab{
%%{
machine lexer;
WILDCARD = /[A-Za-z0-9;\/\?:@=&$_\.\+!\*'~#^,%:\-]/*;
SUBDOMAIN = /([A-Za-z]([A-Za-z0-9\-]*[A-Za-z0-9])?\.)+/;
SEPERATOR = /[:\/\?=&]/;
main :=
(WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD) |
(WILDCARD '.a3s?n=' WILDCARD '&zone_id=' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD ';adtech;' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adiframe|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adserv|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/affiliates.' WILDCARD '.aspx?' WILDCARD) |
(WILDCARD '/affiliates/' WILDCARD '/show_banner.' WILDCARD) |
(WILDCARD '/banner_js.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannerframe.' WILDCARD '?' WILDCARD) |
(WILDCARD '/banners.' WILDCARD '&iframe=' WILDCARD) |
(WILDCARD '/bannerview.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannery/' WILDCARD '?banner=' WILDCARD) |
(WILDCARD '/cdn-cgi/pe/bag?r[]=' WILDCARD 'cpalead.com' WILDCARD) |
(WILDCARD '/delivery/' WILDCARD '?advplaces=' WILDCARD) |
(WILDCARD '/eas?camp=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';ord=' WILDCARD) |
(WILDCARD '/ireel/ad' WILDCARD '.jpg' WILDCARD) |
(WILDCARD '/is.php?ipua_id=' WILDCARD '&search_id=' WILDCARD);
write data;
}%%
bool matchBlacklist(const char *data) {
const char *p = data;
const char *pe = data + strlen(data);
int cs;
//write init
%% write init;
// write exec
%% write exec;
if (cs >= lexer_first_final)
return true;
return false;
}
}
AFAIK, you're facing the "DFA space explosion".
DFA has to match all your rules in one pass over the string. To do that every state needs a transition 1) to the beginning of every rule and 2) to the middle of every interleaving rule.
Further, the WILDCARD
might be creating "nondeterministic behaviour" because, for example, in the WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD
rule the WILDCARD
will match &prvtof=
. This, and the sheer amount of options in the WILDCARD
might further explode the DFA.
In the Ragel 6.8 manual there are guidelines on how to simplify the DFA in the sections "2.5.5 Concatenation" and "4. Controlling nondeterminism".
To avoid the "DFA space explosion" you might want to kind of "deoptimize" the Ragel machine by using scanners, thus selectively switching from a "stateless" DFA to backtracking. And you might want to reduce nondeterminism by using the strong difference operator. And you might want to simplify the WILDCARD
, replacing it with any
.
action matched {return true;}
main := |*
'&prvtof=' (any* -- '&poru=') '&poru=' => matched;
'.a3s?n=' (any* -- '&zone_id=') '&zone_id=' => matched;
any;
*|