Search code examples
c#regexsprache

Regex for ignoring consecutive quotation marks in string


I have built a parser in Sprache and C# for files using a format I don't control. Using it I can correctly convert:

a = "my string";

into

my string

The parser (for the quoted text only) currently looks like this:

public static readonly Parser<string> QuotedText =
    from open in Parse.Char('"').Token()
    from content in Parse.CharExcept('"').Many().Text().Token()
    from close in Parse.Char('"').Token()
    select content;

However the format I'm working with escapes quotation marks using "double doubles" quotes, e.g.:

a = "a ""string"".";

When attempting to parse this nothing is returned. It should return:

a ""string"".

Additionally

a = "";

should be parsed into a string.Empty or similar.

I've tried regexes unsuccessfully based on answers like this doing things like "(?:[^;])*", or:

public static readonly Parser<string> QuotedText =
    from content in Parse.Regex("""(?:[^;])*""").Token()

This doesn't work (i.e. no matches are returned in the above cases). I think my beginners regex skills are getting in the way. Does anybody have any hints?

EDIT: I was testing it here - http://regex101.com/r/eJ9aH1


Solution

  • If I'm understanding you correctly, this is the kind of regex you're looking for:

    "(?:""|[^"])*"
    

    See the demo. 1. " matches an opening quote 2. (?:""|[^"])* matches two quotes or any chars that are not a quote (including newlines), repeating 3. " matches the closing quote.

    But it's always going to boil down to whether your input is balanced. If not, you'll be getting false positives. And if you have a string such as "string"", which should be matched?"string"",""`, or nothing?... That's a tough decision, one that, fortunately, you don't have to make if you are sure of your input.