word segmentation for hashtag using R

i would like to do a word segmentation for hashtag. i want to split word in hashtag. this is my attempt but obviously it didn't work.

what i am trying to do

INPUT: #sometrendingtopic
OUTPUT: some trending topic

my attempt:

s<- "#sometrendingtopic"
tokenize_character_shingles(s)
tokenize_words(s)
tokenize_characters(s)

I got some information but it for python https://stackoverflow.com/.../r-split-string-by-symbol thanks for future idea and guidance

Solution

So ... This is an absolutely non trivial task and I think can not be solved generally. Since you are missing a delimiter between your words, you basically need to extract substrings and check them against a dictionary of your desired language. A very crude method, that will only extract the longest matches from left to right it can find is using hunspell which is designed for spell checking but can be "misused" to maybe solve this task:

split_words <- function(cat.string){
  split <- NULL
  start.char <- 1
  while(start.char < nchar(cat.string))
  {
    result <- NULL
    for(cur.char in start.char:nchar(cat.string))
    {
      test.string <- substr(cat.string,start.char,cur.char)
      test <- hunspell::hunspell(test.string)[[1]]
      if(length(test) == 0) result <- test.string
    }
    if(is.null(result)) return("")
    split <- c(split,result)
    start.char <- start.char + nchar(result)
  }
  split
}


input <- c("#sometrendingtopic","#anothertrendingtopic","#someveryboringtopic")

# Clean the hashtag from the input
input <- sub("#","",input)
#apply word split
result <- lapply(input,split_words)
result
[[1]]
[1] "some"     "trending" "topic"   

[[2]]
[1] "another"  "trending" "topic"   

[[3]]
[1] "some"   "very"   "boring" "topic"

Please keep in mind that this method is far from perfect in multiple ways:

It is relatively slow.
It will greedily match from left to right. So if we for example have the hashtag input <- "#averyboringtopic" the result will be

[[3]]
[1] "aver"   "y"      "boring" "topic"

Since "aver" apparently is a possible word in this specific dictionary. So: Use at your own risk and improve upon this!