Search code examples
regexscalaparsingcurly-braceswml

Scala: Regular Expression pattern match with curly braces?


so I am creating an WML like language for my assignment and as a first step, I am supposed to create regular expressions to recognize the following:

//single = "{"
//double = "{{"
//triple = "{{{"

here is my code for the second one:

val double = "\\{\\{\\b".r

and my Test is:

println(double.findAllIn("{{ s{{ { {{{ {{ {{x").toArray.mkString(" "))

Bit it doesn't print anything ! It's supposed to print the first, second, fifth and 6th token. I have tried every single combination of \b and \B and even \{{2,2} instead of \{\{ but it's still not working. Any help??

As a side question, If I wanted it to match just the first and fifth tokens, what would I need to do?


Solution

  • I tested your code (Scala 2.12.2 REPL), and in contrary to your "it doesn't print anything" statement, it actually prints "{{" occurrence from "{{x" substring.

    This is because x is a word character and \b matches a position between second { and x. Keep in mind that { isn't a word character, unlike x.

    As per this tutorial

    It matches at a position that is called a "word boundary". This match is zero-length

    There are three different positions that qualify as word boundaries:

    1) Before the first character in the string, if the first character is a word character

    ...

    As for solution, it depends on precise definition, but lookarounds seemed to work for me:

    "(?<!\\{)\\{{2}(?!\\{)".r
    

    It matched "first, second, fifth and 6th token". The expression says match "{{" not preceded and not followed by "{".

    For side-question:

    "(?<![^ ])\\{\\{(?![^ ])".r //match `{` surrounded by spaces or line boundaries
    

    Or, depending on your interpretation of "space":

    "(?<!\\S)\\{\\{(?!\\S)".r
    

    matched 1st and 5th tokens. I couldn't use positive lookarounds coz I wanted to take line beginnings and endings (boundaries) into account automatically. So double negation by ! and [^ ] created an effect of implicit inclusion of ^ and $. Alternatively, you could use:

    "(?<=^|\\s)\\{\\{(?=\\s|$)".r
    

    You can read about lookarounds here. Basically they match the symbol or expression as boundary; simply saying they match stuff but don't include it in the matched string itself.

    Some examples of lookarounds

    • (?<=z)aaa matches "aaa" that is preceded by z
    • (?<!z)aaa matches "aaa" that is not preceded by z
    • aaa(?=z) matches "aaa" followed by z
    • aaa(?!z) matches "aaa" not followed by z

    P.S. Just to make your life easier, Scala has """ for escaping, so let's say instead of:

    "(?<!\\S)\\{\\{(?!\\S)".r
    

    you can just:

    """(?<!\S)\{\{(?!\S)""".r