I can't seem to get a regex that matches either a hashtag #
, an @
, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello @world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "@world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString"
which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]
. Unfortunately this doesn't include the hashtags and @
s. It seems like it should work simply with:
/[\b@#]\w{3,12}\b/
but that returns ["@world", "#ruby"]
. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[@#]\w{3,12}\b/
returns ["", "", "@world", "", "#ruby", "", "", ""]
.
/((\b|[@#])\w{3,12}\b)/
matches the right things, but returns [[""], ["@"], ["#"], [""]]
as expected, because the braces also mean capture everything enclosed.
/((\b|[@#])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["@world", "@"], ["#ruby", "#"]]
. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[@#])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect
post-processing?
You can just use the regular expression /[@#]?\b\w+\b/
. That is, optionally match a @
or #
, followed by a word boundary (in #ruby
, that boundary would be between #
and ruby
, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello @world, #ruby anotherString".scan(/[@#]?\b\w+\b/)
# => ["Hello", "@world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby
by using {3,4}
:
p "Hello @world, #ruby anotherString".scan(/[@#]?\b\w{3,4}\b/)
# => ["#ruby"]