Search code examples
rsplitstemmingtext-analysis

How to split a text into two meaningful words in R


this is the text in my dataframe df which has a text column called 'problem_note_text'

SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm

df$problem_note_text <- tolower(df$problem_note_text)
df$problem_note_text <- tm::removeNumbers(df$problem_note_text)
df$problem_note_text<- str_replace_all(df$problem_note_text, "  ", "") # replace double spaces with single space
df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ")
df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english'))
Words = all_words(df$problem_note_text, begins.with=NULL)

Now have a dataframe which has a list of words but there are words like

"Failureperformed"

which needs to be split into two meaningful words like

"Failure" "performed".

how do I do this, also the words dataframe also contain words like

"im" , "h"

which do not make sense and have to be removed, I do not know how to achieve this.


Solution

  • Given a list of English words you can do this pretty simply by looking up every possible split of the word in the list. I'll use the first Google hit I found for my word list, which contains about 70k lower-case words:

    wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1
    
    check.word <- function(x, wl) {
      x <- tolower(x)
      nc <- nchar(x)
      parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc)))
      parts[,parts[1,] %in% wl & parts[2,] %in% wl]
    }
    

    This sometimes works:

    check.word("screenunable", wl)
    # [1] "screen" "unable"
    check.word("nowhere", wl)
    #      [,1]    [,2]  
    # [1,] "no"    "now" 
    # [2,] "where" "here"
    

    But also sometimes fails when the relevant words aren't in the word list (in this case "sensor" was missing):

    check.word("sensoradvise", wl)
    #     
    # [1,]
    # [2,]
    "sensor" %in% wl
    # [1] FALSE
    "advise" %in% wl
    # [1] TRUE