I have a dataset that looks something like this:
name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)
name year
1 john 2010
2 john 2011
3 john 2012
4 alex 2011
5 alex 2012
6 tim 2010
7 tim 2011
8 tim 2012
9 ralph 2014
10 ralph 2016
I want to count the two following things in this dataset:
As an example for 1):
# sample output for 1)
year count
1 2010, 2011, 2012 2
2 2011, 2012 1
3 2014, 2016 1
And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:
# sample output for 2)
year count
1 2014, 2016 1
Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?
EDIT: For anyone using an older version of R, @Rui Barradas provided an answer for 2) - I have included it here so that there is no ambiguity when copy/pasting:
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, function(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
Here are base R solutions.
# 1.
agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
as.data.frame(table(agg$year))
#> Var1 Freq
#> 1 2010, 2011, 2012 2
#> 2 2011, 2012 1
#> 3 2014, 2016 1
# 2.
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
#> Var1 Freq
#> 1 2014, 2016 1
# final clean up
rm(agg)
Created on 2022-12-03 with reprex v2.0.2
Answering to the comment/request,
Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?
a way is to, in each group of name
, first sort the data by year. Then run the code above.
my_data <- my_data[order(my_data$name, my_data$year), ]