My question is the Scala (Java) variant of this query on Python.
In particular, I have a string val myStr = "Shall we meet at, let's say, 8:45 AM?"
. I would like to tokenize it and retain the delimiters (all except whitespace). If my delimiters were only characters, e.g. .
, :
, ?
etc., I could do:
val strArr = myStr.split("((\\s+)|(?=[,.;:?])|(?<=\\b[,.;:?]))")
which yields
[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]
However, I wish to make the time signature \\d+:\\d+
a delimiter, and would still like to retain it. So, what I'd like is
[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]
Note:
(?=(\\d+:\\d+))
in the expression of the split statement is not helping:
is a delimiter in itselfHow could I make this happen?
I suggest matching all your tokens, not splitting a string, because that way you may control what you get in a better way:
\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
See the regex demo.
We start matching the most specific patterns and the last one is the most generic one.
Details
\b\d{1,2}:\d{2}\b
- 1 to 2 digits, :
, 2 digits enclosed with word boundaries|
- or[,.;:?]+
- 1 or more ,
, .
, ;
, :
, ?
chars|
- or(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
- matches any char that is not our delimiter char or whitespace ([^\s,.;:?]
) that is not a starting point for the time string.Consider this snippet:
val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
rx findAllIn str foreach println
Output:
Shall
we
meet
at
,
let's
say
,
8:45
AM
?