Search code examples
rfunctiontextdplyrsample

Random sample from multiple columns


I have a dataset including multiple columns, where each row represents a product and each column includes one comment on the respective product. For each product, we observe multiple comments, each stored in its own column.

Now I want to create two new datasets in the following way: (1) a dataset with only one column, including a random sample of x (number of) comments out of multiple comment columns. (2) as (1), but now I want to sample the same number of comments from each column (e.g., 2 comments from "comment1" and 2 comments from "comment2".

Example data:
commentda = data.frame(product_id = c(1,2,3,4), comment1 = c("Very good", "Bad", "Would buy it", "Zero stars"), comment2 = c("Bad reputation", "Good seller", "Great service", "I will buy it again"))
> 
> commentda
  product_id     comment1            comment2
1          1    Very good      Bad reputation
2          2          Bad         Good seller
3          3 Would buy it       Great service
4          4   Zero stars I will buy it again

Solution

  • You may get the data in long format which will help to do such operations efficiently.

    library(dplyr)
    n <- 2
    
    long_data <- commentda %>%  tidyr::pivot_longer(cols = starts_with('comment'))
    
    1. To include random n comments
    long_data %>% slice_sample(n = n)
    
    1. To include random n comments from each column.
    long_data %>%  group_by(name) %>%  slice_sample(n = n)