Search code examples
javaregexescapingquotes

Regular expression to match escaped characters (quotes)


I want to build a simple regex that covers quoted strings, including any escaped quotes within them. For instance,

"This is valid"
"This is \" also \" valid"

Obviously, something like

"([^"]*)"

does not work, because it matches up to the first escaped quote.

What is the correct version?

I suppose the answer would be the same for other escaped characters (by just replacing the respective character).

By the way, I am aware of the "catch-all" regex

"(.*?)"

but I try to avoid it whenever possible, because, not surprisingly, it runs somewhat slower than a more specific one.


Solution

  • The problem with all the other answers is they only match for the initial obvious testing, but fall short to further scrutiny. For example, all of the answers expect that the very first quote will not be escaped. But most importantly, escaping is a more complex process than just a single backslash, because that backslash itself can be escaped. Imagine trying to actually match a string which ends with a backslash. How would that be possible?

    This would be the pattern you are looking for. It doesn't assume that the first quote is the working one, and it will allow for backslashes to be escaped.

    (?<!\\)(?:\\{2})*"(?:(?<!\\)(?:\\{2})*\\"|[^"])+(?<!\\)(?:\\{2})*"
    

    Explanation:

    (?<!\\) No backslashes behind (to make sure we start matching from first one)

    (?:\\{2})* Any number of doubled backslashes (they nullify each other)

    " Quote char

    (?: Open group

    (?<!\\) No backslashes behind (to make sure we start matching from first one)

    (?:\\{2})* Any number of doubled backslashes (they nullify each other)

    \\" Escaped quote char (because these are allowed inside the quotes)

    | Or

    [^"] Anything other than a quote char

    ) Close group

    + 1 or more of what the group matched

    (?<!\\) No backslashes behind (to make sure we start matching from first one)

    (?:\\{2})* Any number of doubled backslashes (they nullify each other)

    " Quote char