I have a dataframe like this:
test1 = data.frame("id" = c("FC01", "FC01", "FC22", "FC03", "FC01"),
"product" = c("p01", "p02", "p03", "p01", "p03"),
"year" = c("2018", "2017", "2015", "2018", "2016"))
I need to find the IDs that appear more than onc, bought between 2016 and 2018, and know which products they bought and which year. Is it possible to create a new dataframe that showing the ids and how many times they appear and when did this happen? Something like this:
test2 = data.frame("times" = c(3, 1), "id" = c("FC01", "FC03"),
"year" = c("2018, 2017, 2016", "2018"))
I used dplyr
and tried to group by id and filter every id that appears more than once, but I don't know how to continue to get something like this test2. I appreciate any tips in this regard.
test1$year <- as.numeric(as.character(test1$year))
test1 %>% filter(between(year,2016,2018))
%>% group_by(id)
%>% summarize(times = n(),
year = toString(unique(year)))
id times year
<fct> <int> <chr>
1 FC01 3 2018 2017 2016
2 FC03 1 2018
Notes:
times
column is easy, we just use the utility function dplyr::n()
.toString(...)
is cleaner code than paste0(as.character(...), collapse=' ')
unique(year)
as you might have multiple entries for same year.filter(between(year, 2016, 2018))
, we must first fix up year
to be numeric, not a factor (or at minimum, make sure the factor levels are also 2015..2018 so that directly doing as.numeric()
works as intended, instead of giving 1..4