Search code examples
c#regexstringdelimitermultiline

How to create parametrized regex (by terms of C#) which matches strings delimited by custom multicharacter delimiter?


So, I want to find strings in a text. The text can contain multiple lines. The strings can be delimited by custom delimiters - this should be parameterized. There can be multiple strings in the text, even in one line. For example: if the delimiter is (three double quatation marks): """ then in this text:

lorem ipsum """findthis""" "but not this" 'nor this' """anotherstringtofind"""

""blabla"" """yet another""""""text to find"""

It should find: findthis, anotherstringtofind, yet another, text to find. (Notice, that the delimiters are not present in the matched strings, although I can remove them using C#, if needed.)

I can do a similar thing, just for one character delimiters: with regex: "[{0}](([^{0}])*)[{0}]"

Like this:

public static MatchCollection FindString(this string input, char delimeter, RegexOptions regexOptions = RegexOptions.Multiline)
{
    var regexString = string.Format("[{0}](([^{0}])*)[{0}]", delimeter);
    var rx = new Regex(regexString, regexOptions);

    MatchCollection matches = rx.Matches(input);

    return matches;
}

I guess, the solution would use look-ahead operators, but I could not figure out how to combine it with something, which has similar effect like [^] in case of single characters. Is it even possible to "negate" a whole sequence of characters (to not put them into the matches)?

I think this question is similar, but I'm not familiar with Python.

Some clarification: My expectation is to use each and delimiter pair exactly once. So, e.g. this pass should pass:

            var inputText = "??abc?? ??def?? ??xyz??";

            var matches = inputText.FindString("??", RegexOptions.Singleline);

            Assert.Equal(3, matches.Count);

Is it possible to solve this in C# using regex? Thank you in advance!


Solution

  • You can use lazy quantifier instead of negated character class. In you example with """ it should lead to regex like """(.*?)"""

    Also, notice that your current attempt incorrectly uses character classes for delimiters, as ["""] is equivalent to ["], and in turn to simple ". Use your delimiter as is, without any additional wrappers.

    But don't forget to escape your delimiter before use in regex. So, that if you have delimiter like [] in regex it should be \[\].

    Your method would look like this:

    public static MatchCollection FindString(string input, string delimiter, RegexOptions regexOptions = RegexOptions.Multiline)
    {
        string pattern = string.Format("{0}(.*?){0}", Regex.Escape(delimiter));
        var rx = new Regex(pattern, regexOptions);
        return rx.Matches(input);
    }
    

    Is it even possible to "negate" a whole sequence of characters

    Yes, it is possible: (?:(?!foo).)+ can be used to match something like this. Or for your example """(?:(?!""").)*""". But it would be way worse performance-wise comparing to simple lazy quantifier.