Search code examples
regexscalaspecial-characters

Scala regex match lines with special characters


I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).

I already checked my regex on http://www.regexr.com/ and there it works.

scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"

scala> mystr.matches(myreg)
res2: Boolean = false

What I found out is that the problem is related to special characters. For example a simple example:

scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false

scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true

It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about

error: invalid escape character

So how can I deal with this messy input and still use my regex to filter out some lines?


Solution

  • You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.

    In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.

    So, to match all your inputs, you need to use the following regexps:

    val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
    val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
    println(mystr.matches(myreg)) // => true
    val tabstr = """123456\t123456"""
    println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
    val tabstr2 = "123456\t123456"
    println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true
    

    Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.

    The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.

    The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).

    • [0-9]+ - 1 or more digits
    • (?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
    • (?:\\r)? - 1 or 0 occurrences of \r
    • (?:\\n) - one occurrence of a literal combination of \ and n.