I have survey information that contains free text that I would like to clean then put into a transaction dataset to run in the arules R package. Right now the text looks like this.
id | Answers
1 | John thinks that the product is not worth the price
2 | Amy believes that the functionality is well above expectations
Here's what I'm trying to do:
1 | John | thinks | Product | Not | Worth | Price
1 | Amy | Believes | Functionality | Above | Expectations
Right now I have been able to clean the data using tm
package but I don't know what is the best way to convert it to a transaction dataset. I've turned the information into all lowercase and removed the stop words.
Let's just say my data is in data frame called "Questions". I am unable to convert the corpus into a transaction dataset after I have cleaned it.
You can try:
library(stringr)
str_split(data$Answers, " ")
The output is a list:
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "the" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"
Removing duplicates using the unique
function:
my_list <- str_split(data$Answers, " ")
lapply(my_list , unique)
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"