In Elixir, I would like to split a string, treating all the non-word characters as separators, including the "Ogham Space Mark ( )" (which should not be confused for a minus (-) sign).
So, if I split the string:
"1\x002\x013\n4\r5 6\t7 + asda - 3434"
The result should be:
["1","2","3","4","5","6","7","+","asda","-","3434"]
I'm trying to figure out how to do this with Regex, but the best I've been able to accomplish so far is:
Regex.split(~r/[\W| ]+/, input_string)
.... but this drops the +
and -
sign as these are not considered word characters.
or
Regex.split(~r/[^[:punct:]|^[:alnum:]| ]+/, input_string)
but this fails to split on the Ogham Space Mark.
This will actually work correctly, but it is inelegant for the extra transformation:
Regex.split(~r/[^[:punct:]|^[:alnum:]]+/, String.replace(input_string, " ", " "))
Is there any way to split this with a single Regex invocation?
Elixir regular expressions are handled by the PCRE regex engine, and your input string contains characters from the whole Unicode character table, not just the ASCII part.
You may enable Unicode mode with the help of two PCRE verbs, (*UTF)(*UCP)
:
Regex.split(~r/(*UTF)(*UCP)[^\w\/*+-]+/, "1\x002\x013\n4\r5 6\t7 + asda - 3434")
It will output:
["1", "2", "3", "4", "5", "6", "7", "+", "asda", "-", "3434"]
See the Elixir demo online.
NOTE: ~r/[^\w\/*+-]+/u
and ~r/(*UTF)(*UCP)[^\w\/*+-]+/
are equivalent, u
is a shorthand for the two PCRE verbs.
The regex matches
(*UTF)(*UCP)
- (*UTF)
treats the input string as a Unicode code point sequence and (*UCP)
makes the \w
Unicode aware (so that is matches [\p{L}\p{N}_]
characters)[^\w\/*+-]+
- 1 or more characters other than letters, digits, /
, *
, +
and -
.Note that -
in the meaning of a literal -
char does not have to be escaped when placed at the end of the character class.