Search code examples
regexscalareplaceall

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?


I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.

So far, I have been successful at entering a new line after the pipe delimiter. I'm unable to replace the delimiter itself.

My input string is

val txt = "January|February|March|April|May|June|July|August|September|October|November|December"

println(txt.replaceAll(".\\|", "$0\n"))

The above statement generates the following output.

January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December

I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.

I'm expecting my output to be as given below.

January|February
March|April
May|June
July|August
September|October
November|December

How do I change my regular expression to get the desired output?

Update: My friend suggested I try the following statement

println(txt.replaceAll("(.*?\\|){2}", "$0\n"))

and this produced the following output

January|February|
March|April|
May|June|
July|August|
September|October|
November|December

Now I just need to get rid of the pipe symbol at the end of each line.


Solution

  • You want to move the 2nd bar | outside of the capture group.

    txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
    //val res0: String =
    //  January|February
    //  March|April
    //  May|June
    //  July|August
    //  September|October
    //  November|December
    

    Regex Explained (regex is not Scala)

    • ( - start a capture group
    • [^|] - any character as long as it's not the bar | character
    • [^|]+ - 1 or more of those (any) non-bar chars
    • \\| - followed by a single bar char |
    • [^|]+ - followed by 1 or more of any non-bar chars
    • ) - close the capture group
    • \\| - followed by a single bar char (not in capture group)
    • "$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char

    UPDATE

    For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.

    The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.

    val n = 5
    txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
       .replaceAll("\\|\n", "\n")
    //val res0: String =
    //  January|February|March|April|May
    //  June|July|August|September|October
    //  November|December