Search code examples
javaregexstringscalatokenize

Java string tokenization: Split on pattern and retain pattern


My question is the Scala (Java) variant of this query on Python.

In particular, I have a string val myStr = "Shall we meet at, let's say, 8:45 AM?". I would like to tokenize it and retain the delimiters (all except whitespace). If my delimiters were only characters, e.g. ., :, ? etc., I could do:

val strArr = myStr.split("((\\s+)|(?=[,.;:?])|(?<=\\b[,.;:?]))")

which yields

[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]

However, I wish to make the time signature \\d+:\\d+ a delimiter, and would still like to retain it. So, what I'd like is

[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]

Note:

  1. Adding the disjunct (?=(\\d+:\\d+)) in the expression of the split statement is not helping
  2. outside of the time signature, : is a delimiter in itself

How could I make this happen?


Solution

  • I suggest matching all your tokens, not splitting a string, because that way you may control what you get in a better way:

     \b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
    

    See the regex demo.

    We start matching the most specific patterns and the last one is the most generic one.

    Details

    • \b\d{1,2}:\d{2}\b - 1 to 2 digits, :, 2 digits enclosed with word boundaries
    • | - or
    • [,.;:?]+ - 1 or more ,, ., ;, :, ? chars
    • | - or
    • (?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+ - matches any char that is not our delimiter char or whitespace ([^\s,.;:?]) that is not a starting point for the time string.

    Consider this snippet:

    val str = "Shall we meet at, let's say, 8:45 AM?"
    var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
    rx findAllIn str foreach println
    

    Output:

    Shall
    we
    meet
    at
    ,
    let's
    say
    ,
    8:45
    AM
    ?