Search code examples
.netregexexcelopenxmlregex-group

Regex group string where the delimiter can be used twice (.net regex)


I am writing a parser for excel that can update values in the document. I am currently parsing the header/footer part of the spreadsheet document format. The format of a header/footer in excel is stored as plain text, delimited by:

  • &L
  • &C
  • &R

So your header / footer might look like this in the xml:

&LTodaysDate&CDocumentTitle&RAuthors Name

If you have only a left and right header your xml string would look like:

&LTodaysDate&RAuthors Name

I have tried to create a pattern that can detect each of thee groups and parse out the component (ie &L,&C,&R) along with any text that occurs after that tag.


The regex string is this: (&.{1})([A-Za-z\d_ ]*) (Link to example)

However I have a fringe case problem that will mean I can't properly parse excel header that contain ampersands.

In an excel header for your document to have an ampersand in the title (this is in plain text) you must type &&. So the xml of a header with an ampersands might look like:

&RPork && Beans (which would display "Pork & Beans" in the spreadsheet).

My regex is not able to cope with the premature ampersand. In the first group ((&.{1})) I ask for anything that has an ampersand and the character that follows it (ie an L/C/R). How can I tell this group to not include when there are 2 ampersands. My regex skills are pretty novice, I can describe what I want at a higher level:

I want to split the string wherever I see &L/&C/&R and capture all text after this, up to another &L/&C/&R delimiter (excluding new line spaces etc). I can best describe this in C# linq below.

(&.{1}.Where(c => c != '&'))([A-Za-z\d_ ]*)

For the string "&RPork && Beans"

my regex captures 2 matches each with 2 groups:

match 1
group 1: "&R" group 2: "Pork "

match 2
group 1: "&&" group 2: " Beans"

and I'd want it to match once:
group 1: "&R" group 2: "Pork && Beans"

Thanks for the help


Solution

  • You may use

    var result = Regex.Split(s, "(&[LRC])").Where(x => !string.IsNullOrWhiteSpace(x));
    

    See the regex demo. The (&[LRC]) will match & and a L, R or C letter after it, this value will be extracted into the resulting array due to the capturing parentheses.

    enter image description here

    Another usage example:

    var s = "&RPork && Beans&CDocument Title";
    var result = Regex.Split(s, "(&[LRC])")
            .Where(x => !string.IsNullOrWhiteSpace(x))
            .ToList();
    var data = result.Where((c,i) => i % 2 == 0).Zip(result.Where((c,i) => i % 2 != 0),
            (delimiter, value) => new KeyValuePair<string, string>(delimiter, value));
    foreach (var kvp in data)
        Console.WriteLine("Delimiter: {0}\nValue: {1}", kvp.Key, kvp.Value);
    

    Output:

    Delimiter: &R
    Value: Pork && Beans
    Delimiter: &C
    Value: Document Title