Search code examples
javaregexstringstring-literalsjavacc

What is Regular expression to identify string literals in java?


I am trying to write parser for which i need to identify string literals, if my string starts and ends with ' (i.e single quote) then what will be the regular expression to identify string literal?

I'm using javacc for writing parser. can anybody help me with actual regular expression code in token format? i have tried enough on my own.

eg.

< INTEGER_VALUE : "0" | (["1"-"9"] (["0"-"9"])*) >

this is the token format to identify integer literal, I want same token format for string literal where string starts and end with single quote, I also tried using metacharacters (given in http://www.vogella.com/articles/JavaRegularExpressions/article.html tutorial) but there were no successful results.


Solution

  • I'm assuming that you are using JavaCC. The answer depends on the syntax of strings in your language. Let's say any character is allowed in a string other than an apostrophe. I.e. a string consists of two apostrophes and any number (0 or more) of nonapostrophes in between.

    <STRING: "'" (~["'"])* "'">
    

    Now many languages don't allow newlines or returns in strings. So here let's ban those too:

    <STRING: "'" (~["'","\n","\r"])* "'">
    

    Now the problem is: what if someone wants to put apostrophes, newlines or returns? Some languages (e.g. C) use backslashes as an escape, so let's say

    • \' means apostrophe
    • \n means newline
    • \r means return
    • \\ means backslash
    • \x where x is any other character is considered an error

    Here is the expression

    <STRING: "'"  ("\\" ("\\" | "n" | "r" | "'") | ~["\\","\n","\r","'"] )* "'"
    

    I.e. a string is two apostrophes with a sequence of 0 or more groups in between, where each group is either one of the two character sequences \\, \n, \r, \', or a character that is not a backslash, a newline, a return or an apostrophe.

    Another approach is to use lexical states.

    <DEFAULT> MORE: { "'" : INSTRING }
    <INSTRING> MORE: { "\\\\" 
                     | "\\n" 
                     | "\\r"  
                     | "\\'"
                     | ~["\\","\n","\r","'"]
                     }
    <INSTRING> TOKEN: { "'" : DEFAULT }