Search code examples
.netc#-11.0.net-7.0

Regex source generator with large regex causes OutOfMemoryException at dotnet build


I have a large regex in the form of a word list separated by |. The entire regex length is 1 million characters

[RegexGenerator(@"KnownItem1|KnownItem2|KnownItem3")]
private static partial Regex NamedEntities();

Building with dotnet build results in this error

CSC : warning CS8785: Generator 'RegexGenerator' failed to generate source. It will not contribute to the output and co
mpilation errors may occur as a result. Exception was of type 'OutOfMemoryException' with message 'Exception of type 'S
ystem.OutOfMemoryException' was thrown.'

The dotnet.exe process took up 5GB of RAM when the above error is encountered. How can I get the build to succeed?

I had searched for how to increase RAM used by dotnet build, how to reduce RAM usage by not emitting debug symbols, but did not find a solution. This is also a different case compared to Regex OOM at runtime which has been asked many times on StackOverflow -- this is a compile time error with the new Regex source generator. This regex works in interpreted mode at runtime.


Solution

  • The .NET regex is not optimized for this use case currently. There is discussion about optimizing for this case with Aho-Corasick.

    The workaround now is to use Aho-Corasick and eliminate overlapping matches manually.

    E.g.

    hellover when matched on regex hello|lover matches hello.

    However, Aho-Corasick matches both hello and lover, thus you'll have to keep track of the indices and lengths of the matches returned and eliminate lover in order to mimic the regex behaviour.