I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter. I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update: My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar |
outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
(
- start a capture group[^|]
- any character as long as it's not the bar |
character[^|]+
- 1 or more of those (any) non-bar chars\\|
- followed by a single bar char |
[^|]+
- followed by 1 or more of any non-bar chars)
- close the capture group\\|
- followed by a single bar char (not in capture group)"$1\n"
- replace the entire matching string with just the first $1
capture group ($0
is the entire matching string) followed by the newline charUPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String
twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December