Search code examples
rdataframedplyrr-factorsummarize

How to count factors frequency and organize in a new dataframe in R


I have a dataframe like this:

test1 = data.frame("id" = c("FC01", "FC01", "FC22", "FC03", "FC01"),
                    "product" = c("p01", "p02", "p03", "p01", "p03"),
                    "year" = c("2018", "2017", "2015", "2018", "2016"))

I need to find the IDs that appear more than onc, bought between 2016 and 2018, and know which products they bought and which year. Is it possible to create a new dataframe that showing the ids and how many times they appear and when did this happen? Something like this:

test2 = data.frame("times" = c(3, 1), "id" = c("FC01", "FC03"),
                   "year" = c("2018, 2017, 2016", "2018"))

I used dplyr and tried to group by id and filter every id that appears more than once, but I don't know how to continue to get something like this test2. I appreciate any tips in this regard.


Solution

  • test1$year <- as.numeric(as.character(test1$year))
    
    test1 %>% filter(between(year,2016,2018))
          %>% group_by(id)
          %>% summarize(times = n(),
              year = toString(unique(year)))
    
    
      id    times year          
      <fct> <int> <chr>         
    1 FC01      3 2018 2017 2016
    2 FC03      1 2018  
    

    Notes:

    • Getting the times column is easy, we just use the utility function dplyr::n().
    • For the pasted list of (unique) string names of years, same approach as this answer. toString(...) is cleaner code than paste0(as.character(...), collapse=' ')
    • Note we must use unique(year) as you might have multiple entries for same year.
    • In order to be able to filter(between(year, 2016, 2018)), we must first fix up year to be numeric, not a factor (or at minimum, make sure the factor levels are also 2015..2018 so that directly doing as.numeric() works as intended, instead of giving 1..4