Search code examples
rtext-miningmarket-basket-analysis

Best way to clean free text then turn into a transaction dataset


I have survey information that contains free text that I would like to clean then put into a transaction dataset to run in the arules R package. Right now the text looks like this.

id | Answers    
1  | John thinks that the product is not worth the price
2  | Amy believes that the functionality is well above expectations 

Here's what I'm trying to do:

1 | John | thinks   | Product       | Not   | Worth | Price    
1 | Amy  | Believes | Functionality | Above | Expectations

Right now I have been able to clean the data using tm package but I don't know what is the best way to convert it to a transaction dataset. I've turned the information into all lowercase and removed the stop words.

Let's just say my data is in data frame called "Questions". I am unable to convert the corpus into a transaction dataset after I have cleaned it.


Solution

  • You can try:

    library(stringr)
    str_split(data$Answers, " ")
    

    The output is a list:

    [[1]]
     [1] "John"    "thinks"  "that"    "the"     "product" "is"      "not"     "worth"   "the"     "price"  
    
    [[2]]
    [1] "Amy"           "believes"      "that"          "the"           "functionality" "is"           
    [7] "well"          "above"         "expectations" 
    

    Edit:

    Removing duplicates using the unique function:

    my_list <- str_split(data$Answers, " ")
    lapply(my_list , unique)
    
    [[1]]
    [1] "John"    "thinks"  "that"    "the"     "product" "is"      "not"     "worth"   "price"  
    
    [[2]]
    [1] "Amy"           "believes"      "that"          "the"           "functionality" "is"           
    [7] "well"          "above"         "expectations"