javaregexscala

Find words starting with hashtag in Scala/Java


I'm trying to write regex that will split all words starting with hashtag.
For example in following text it should :

val regex = "???".r

val text = "#shouldMatch1 #shouldMatch2 notMatch nope#shouldMatch3 nooope()#shouldMatch4"

regex.split(text).toList shouldBe List("#shouldMatch1", "#shouldMatch2", "#shouldMatch3", "#shouldMatch4")

The closes that I could get is val regex: Regex = "[^#\\w+]".r, but it splits a litte bit more:

List("#shouldMatch1", "#shouldMatch2", "notMatch", "nope#shouldMatch3", "nooope", "#shouldMatch4")

So in some cases it finds words that do not start with hashtag. Do you have any idea or guidance how I should write proper expression?

Code was written in Scala but should the similar in Java.


Solution

  • You need to use findAllIn with a regex like #\w+:

    val regex = """#\w+""".r
    val text = "#shouldMatch1 #shouldMatch2 notMatch nope#shouldMatch3 nooope()#shouldMatch4"
    println(regex.findAllIn(text).toList)
    

    See the Scala demo.

    The hashtag matching pattern can be different, there are a lot of variations. Here are some of them:

    • #\w+ - if the hashtags can contain only word chars
    • #[\w-]+ - if the hashtags can contain only word and hyphen chars
    • #\S+ - if the hashtags contain any amount of one or more non-whitespace chars after #
    • #\S+\b - if the hashtags contain any amount of one or more non-whitespace chars after # but you want it to stop before the final sequence of non-word chars (like a comma, etc)
    • (?<!\S)#\S+ - if the hashtags contain any amount of one or more non-whitespace chars after #, but before #, there can only be whitespace or start of string.