I have written an application named address_parser.exe
in C# (WinForm), targeted for PCs running Windows XP, Vista, 7 and 8. With the .NET Framework version 3.5 being the minimal set up...
The application reads in and parses text files (plain text files only, as I have no control over the input files so XML is not an option, unfortunately).
These text files contain a set of data, lets say an address, split over multiple, non consecutive, lines.
Please have a look at the following two text files as a demo:
address_type_1.txt:
Elm Grove
47
PO5 1JF
Southsea
and
address_type_2.txt:
Southsea
Albert Road
147b
PO4 0JW
Now, currently I have hard coded the information where in the input file the street, the house number, the zip code and the city is located, in my code. So for each address file type if have created a set of rules, which line contains which information.
In addition, I have a set of regular expressions that check the validity of each information (street, house number, zip code, city).
Since these two sets of rules/checks (which line contains which information/regex pattern for each information) vary for each different address type, I would like to store these rules in a sort of config file. So instead of hard coding this, I would like to have a configuration file for each address type, that my application can read and configure itself how to parse the particular address file type.
I would like to get some ideas and inspiration from you. Please share your thoughts and best practises!
Thanks!
Below are some thoughts of mine, and code snippets I am using so far...
My currently hard coded address file parsing runs like this:
public static Address Parse(string fileName)
{
var a = new Address();
a.OriginalFile = fileName;
int i = 0;
using (var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.None))
{
using (var reader = new StreamReader(fs, Encoding.GetEncoding(65001)))
{
Regex rgxStreet = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
Regex rgxNumber = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$");
Regex rgxCity = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
Regex rgxZIP = new Regex(@"^([0-9]){5}$");
while (!reader.EndOfStream)
{
var line = reader.ReadLine().TrimEnd(';').Trim();
if (line != null)
{
if (i == 4 && rgxStreet.IsMatch(line))
{
a.Street = line;
}
else if (i == 7 && rgxNumber.IsMatch(line))
{
a.Number = line;
}
else if (i == 12 && (rgxZIP.IsMatch(line) || String.IsNullOrEmpty(line)))
{
a.Zip = line;
}
else if (i == 15 && rgxCity.IsMatch(line))
{
a.City = line;
}
}
i++;
}
}
}
return a;
}
As you can see, I am also using individual regular expressions on those 4 attributes to check if the stuff that I am reading is valid.
Now, I would like to modify this hard coded information (line X contains field Y with regular expression Z) so that I can support reading and parsing files where the same information is stored in a different order, or with different valid values.
The example above targets a file containing an address in Germany (ZIP code is 5 digits).
Parsing another type of text file which contains an adress in the UK may look like this:
line 1: city;
line 2: zip;
line 20: street;
line 159: number;
In this example, the order of the information has changed as well as the needed reg ex for the zip code (postal codes in the UK are 6 digits long, and contain letters and numbers).
Instead of hard coding the information how to parse this type of file, I would like something like a config file which tells my application how to parse a specific type of file. Something like this:
#config file for UK address files:
#line;field;regex;
1;city;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
2;zip;@"^([A-Za-z0-9]){6}$";
20;street;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
150;number;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$";
My question is: is this a good idea, or are there better ways to achieve this (to tell my application how a specific file needs to be read and parsed and its contents interpreted and validated)?
Thank you!
Yes is a good idea, use Newtonsoft.Json
to help you with the config load like
private class StartSettings
{
public string CityReg;
public int CityNum;
public string ZipReg;
public int ZipNum;
public string StreetReg;
public int StreetNum;
public string NumberReg;
public int NumberNum;
}
var configString = File.ReadAllText(configFilePath);
var config = JsonConvert.DeserializeObject<StartSettings>(configString);
And to read the files just use
Regex rgxStreet = new Regex(config.StreetReg);
Regex rgxNumber = new Regex(config.NumberReg);
Regex rgxCity = new Regex(config.CityReg);
Regex rgxZIP = new Regex(config.ZipReg);
foreach (var line = File.ReadLines(fileName, Encoding.GetEncoding(65001))
.Select(l => l.TrimEnd(';').Trim())
{
if(config.CityNum == i && rgxCity.IsMatch(line))
a.City = line;
...
i++;
}
return a;