Search code examples
c#regexquotesescaping

Textparsing and splitting text including/excluding quotation marks


My question is almost identical to an earlier entry I found here but not quite.

I need to parse through a textfile where the data is structured in this way: Each item in the file begins with a # followed by the label. The fields in the post is separated by one or more whitespaces.

Here comes the part I'm having problem with. Each field may or may not me enclosed by quotation marks, it's only required if the data contains spaces.

So what I'm after is a regex that splits by whitespace but not if that whitespace is inside a quotation.

At the moment I'm using a separate regex for each label and then but it would be much more efficient to split it immediatly when reading from the file. As for the account example below (^#[A-z]+)\s([0-9]+)\s(.+)

Example of data

#ACCOUNT 7059 "Misc. travelexpenses"
#ADRESS "M. Jackson" "somewhere over the rainbow" WI53233-1704 555-12345

Solution

  • You can use an "OR" construct, to define possible forms of the fields. Like

    ([A-z]+|"[^"]+") 
    

    matches both Kring and "Mr. Kring".

    Edit: So, to get all your fields and the label in the above records you could use

    (?:^#|\s+)([^"#\s]+|"[^"]+")
    

    http://gskinner.com/RegExr/ is a good way to test Regular Expressions.