Search code examples
rhashtag

word segmentation for hashtag using R


i would like to do a word segmentation for hashtag. i want to split word in hashtag. this is my attempt but obviously it didn't work.

what i am trying to do

  1. INPUT: #sometrendingtopic
  2. OUTPUT: some trending topic

my attempt:

s<- "#sometrendingtopic"
tokenize_character_shingles(s)
tokenize_words(s)
tokenize_characters(s)

I got some information but it for python https://stackoverflow.com/.../r-split-string-by-symbol thanks for future idea and guidance


Solution

  • So ... This is an absolutely non trivial task and I think can not be solved generally. Since you are missing a delimiter between your words, you basically need to extract substrings and check them against a dictionary of your desired language. A very crude method, that will only extract the longest matches from left to right it can find is using hunspell which is designed for spell checking but can be "misused" to maybe solve this task:

    split_words <- function(cat.string){
      split <- NULL
      start.char <- 1
      while(start.char < nchar(cat.string))
      {
        result <- NULL
        for(cur.char in start.char:nchar(cat.string))
        {
          test.string <- substr(cat.string,start.char,cur.char)
          test <- hunspell::hunspell(test.string)[[1]]
          if(length(test) == 0) result <- test.string
        }
        if(is.null(result)) return("")
        split <- c(split,result)
        start.char <- start.char + nchar(result)
      }
      split
    }
    
    
    input <- c("#sometrendingtopic","#anothertrendingtopic","#someveryboringtopic")
    
    # Clean the hashtag from the input
    input <- sub("#","",input)
    #apply word split
    result <- lapply(input,split_words)
    result
    [[1]]
    [1] "some"     "trending" "topic"   
    
    [[2]]
    [1] "another"  "trending" "topic"   
    
    [[3]]
    [1] "some"   "very"   "boring" "topic" 
    

    Please keep in mind that this method is far from perfect in multiple ways:

    1. It is relatively slow.
    2. It will greedily match from left to right. So if we for example have the hashtag input <- "#averyboringtopic" the result will be
    [[3]]
    [1] "aver"   "y"      "boring" "topic" 
    

    Since "aver" apparently is a possible word in this specific dictionary. So: Use at your own risk and improve upon this!