I am writing a parser for excel that can update values in the document. I am currently parsing the header/footer part of the spreadsheet document format. The format of a header/footer in excel is stored as plain text, delimited by:
&L
&C
&R
So your header / footer might look like this in the xml:
<odaysDate&CDocumentTitle&RAuthors Name
If you have only a left and right header your xml string would look like:
<odaysDate&RAuthors Name
I have tried to create a pattern that can detect each of thee groups and parse out the component (ie &L
,&C
,&R
) along with any text that occurs after that tag.
The regex string is this: (&.{1})([A-Za-z\d_ ]*)
(Link to example)
However I have a fringe case problem that will mean I can't properly parse excel header that contain ampersands.
In an excel header for your document to have an ampersand in the title (this is in plain text) you must type &&
. So the xml of a header with an ampersands might look like:
&RPork && Beans
(which would display "Pork & Beans" in the spreadsheet).
My regex is not able to cope with the premature ampersand. In the first group ((&.{1})
) I ask for anything that has an ampersand and the character that follows it (ie an L/C/R). How can I tell this group to not include when there are 2 ampersands. My regex skills are pretty novice, I can describe what I want at a higher level:
I want to split the string wherever I see &L/&C/&R and capture all text after this, up to another &L/&C/&R delimiter (excluding new line spaces etc). I can best describe this in C# linq below.
(&.{1}.Where(c => c != '&'))([A-Za-z\d_ ]*)
For the string "&RPork && Beans"
my regex captures 2 matches each with 2 groups:
match 1
group 1: "&R"
group 2: "Pork "
match 2
group 1: "&&"
group 2: " Beans"
and I'd want it to match once:
group 1: "&R"
group 2: "Pork && Beans"
Thanks for the help
You may use
var result = Regex.Split(s, "(&[LRC])").Where(x => !string.IsNullOrWhiteSpace(x));
See the regex demo. The (&[LRC])
will match &
and a L
, R
or C
letter after it, this value will be extracted into the resulting array due to the capturing parentheses.
var s = "&RPork && Beans&CDocument Title";
var result = Regex.Split(s, "(&[LRC])")
.Where(x => !string.IsNullOrWhiteSpace(x))
.ToList();
var data = result.Where((c,i) => i % 2 == 0).Zip(result.Where((c,i) => i % 2 != 0),
(delimiter, value) => new KeyValuePair<string, string>(delimiter, value));
foreach (var kvp in data)
Console.WriteLine("Delimiter: {0}\nValue: {1}", kvp.Key, kvp.Value);
Output:
Delimiter: &R
Value: Pork && Beans
Delimiter: &C
Value: Document Title