i would like to do a word segmentation for hashtag. i want to split word in hashtag. this is my attempt but obviously it didn't work.
what i am trying to do
my attempt:
s<- "#sometrendingtopic"
tokenize_character_shingles(s)
tokenize_words(s)
tokenize_characters(s)
I got some information but it for python https://stackoverflow.com/.../r-split-string-by-symbol thanks for future idea and guidance
So ... This is an absolutely non trivial task and I think can not be solved generally. Since you are missing a delimiter between your words, you basically need to extract substrings and check them against a dictionary of your desired language.
A very crude method, that will only extract the longest matches from left to right it can find is using hunspell
which is designed for spell checking but can be "misused" to maybe solve this task:
split_words <- function(cat.string){
split <- NULL
start.char <- 1
while(start.char < nchar(cat.string))
{
result <- NULL
for(cur.char in start.char:nchar(cat.string))
{
test.string <- substr(cat.string,start.char,cur.char)
test <- hunspell::hunspell(test.string)[[1]]
if(length(test) == 0) result <- test.string
}
if(is.null(result)) return("")
split <- c(split,result)
start.char <- start.char + nchar(result)
}
split
}
input <- c("#sometrendingtopic","#anothertrendingtopic","#someveryboringtopic")
# Clean the hashtag from the input
input <- sub("#","",input)
#apply word split
result <- lapply(input,split_words)
result
[[1]]
[1] "some" "trending" "topic"
[[2]]
[1] "another" "trending" "topic"
[[3]]
[1] "some" "very" "boring" "topic"
Please keep in mind that this method is far from perfect in multiple ways:
input <- "#averyboringtopic"
the result will be[[3]]
[1] "aver" "y" "boring" "topic"
Since "aver" apparently is a possible word in this specific dictionary. So: Use at your own risk and improve upon this!