My RTF parser needs to process two flavors of rtf files (one file per program execution): rtf files as saved from Word and rtf files as created by a COTS report generator utility. The rtf for each is valid, but different. My parser uses regex patterns to detect, extract, and process the various rtf elements in the two types of rtf files.
I decided to implement the list of rtf regex patterns in two dictionaries, one for the rtf regex patterns needed for a Word rtf file and another for the rtf regex patterns needed for a COTS utility rtf file. At runtime, my parser detects which type of rtf file is being processed (Word rtf includes the rtf element //schemas.microsoft.com/office/word
and the COTS rtf does not) and then obtains the needed regex pattern from the appopriate dictionary.
To ease the task of referencing the patterns as I write the code, I implemented an enum where each enum value represents a specific regex pattern. To ease the task of keeping the patterns in sync with their corresponding enum, I implemented the regex patterns as a here-string
where each line is a csv composition: {enum name}, {word rtf regex pattern}, {cots rtf regex pattern}
. Then, at run time when the patterns are loaded into their dictionaries, I obtain the int value of the enum from the csv and use it to create the dictionary key.
This makes writing the code easier, but I'm not sure it's the best way to implement and reference the rtf expressions. Is there a better way?
Example code:
public enum Rex {FOO, BAR};
string ex = @"FOO, word rtf regex pattern for FOO, cots rtf regex pattern for FOO
BAR, word rtf regex pattern for BAR, cots rtf regex pattern for BAR
";
I load the dictionaries like this:
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
int enumIntValue = (int)(Rex)Enum.Parse(typeof(Rex), splitLine[0].Trim());
ObjWordRtfDict.Add(enumIntValue, line.Split(',')[1].Trim());
ObjRtfDict.Add(enumIntValue, line.Split(',')[2].Trim());
}
}
Then, at runtime, I access ObjWordRtfDict or ObjRtfDict based on the type of rtf file the parser detects.
string regExPattFoo = ObjRegExExpr.GetRegExPattern(ClsRegExExpr.Rex.FOO);
public string GetRegExPattern(Rex patternIndex)
{
string regExPattern = "";
if (isWordRtf)
{
ObjWordRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
else
{
ObjRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
return regExPattern;
}
Modified New code based on Asif's recommendations
I kept my enum for pattern names so references to pattern names can be checked by the compiler
Example csv file included as an embedded resource
SECT,^\\pard.*\{\\rtlch.*\\sect\s\}, ^\\pard.*\\sect\s\}
HORZ_LINE2, \{\\pict.*\\pngblip, TBD
Example usage
string sectPattern = ObjRegExExpr.GetRegExPattern(ClsRegExPatterns.Names.SECT);
ClsRegExPatterns class
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
namespace foo
{
public class ClsRegExPatterns
{
readonly bool isWordRtf = false;
List<ClsPattern> objPatternList;
public enum Names { SECT, HORZ_LINE2 };
public class ClsPattern
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
}
public ClsRegExPatterns(StringBuilder rawRtfTextFromFile)
{
// determine if input file is Word rtf or not Word rtf
if ((Regex.Matches(rawRtfTextFromFile.ToString(), "//schemas.microsoft.com/office/word", RegexOptions.IgnoreCase)).Count == 1)
{
isWordRtf = true;
}
//read patterns from embedded content csv file
string patternsAsCsv = new StreamReader((Assembly.GetExecutingAssembly()).GetManifestResourceStream("eLabBannerLineTool.Packages.patterns.csv")).ReadToEnd();
//create list to hold patterns
objPatternList = new List<ClsPattern>();
//load pattern list
using (StringReader reader = new StringReader(patternsAsCsv))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
ClsPattern objPattern = new ClsPattern
{
Name = splitLine[0].Trim(),
WordRtfRegex = splitLine[1].Trim(),
COTSRtfRegex = splitLine[2].Trim()
};
objPatternList.Add(objPattern);
}
}
}
public string GetRegExPattern(Names patternIndex)
{
string regExPattern = "";
string patternName = patternIndex.ToString();
if (isWordRtf)
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.WordRtfRegex;
}
else
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.COTSRtfRegex;
}
return regExPattern;
}
}
}
If I understand your problem statement correctly; I would rather prefer something like below.
Create a class called RtfProcessor
public class RtfProcessor
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
void ProcessFile()
{
throw new NotImplementedException();
}
}
Where name signifies FOO or BAR etc. You can maintain a list of such files and keep populating from csv files like below
List<RtfProcessor> fileProcessors = new List<RtfProcessor>();
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
RtfProcessor rtfProcessor = new RtfProcessor();
rtfProcessor.Name = splitLine[0].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[1].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[2].Trim();
fileProcessors.Add(rtfProcessor);
}
}
And to retrieve regex pattern for FOO or BAR
// to get the regex parrtern for FOO you can use
fileProcessors.SingleOrDefault(x => x.Name == "FOO")?.WordRtfRegex;
hope this helps.