I have a dataset including multiple columns, where each row represents a product and each column includes one comment on the respective product. For each product, we observe multiple comments, each stored in its own column.
Now I want to create two new datasets in the following way: (1) a dataset with only one column, including a random sample of x (number of) comments out of multiple comment columns. (2) as (1), but now I want to sample the same number of comments from each column (e.g., 2 comments from "comment1" and 2 comments from "comment2".
Example data:
commentda = data.frame(product_id = c(1,2,3,4), comment1 = c("Very good", "Bad", "Would buy it", "Zero stars"), comment2 = c("Bad reputation", "Good seller", "Great service", "I will buy it again"))
>
> commentda
product_id comment1 comment2
1 1 Very good Bad reputation
2 2 Bad Good seller
3 3 Would buy it Great service
4 4 Zero stars I will buy it again
You may get the data in long format which will help to do such operations efficiently.
library(dplyr)
n <- 2
long_data <- commentda %>% tidyr::pivot_longer(cols = starts_with('comment'))
n
commentslong_data %>% slice_sample(n = n)
n
comments from each column.long_data %>% group_by(name) %>% slice_sample(n = n)