Search code examples
rdata-manipulation

Counting Number of Unique Column Values Per Group


I have a dataset that looks something like this:

name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)

    name year
1   john 2010
2   john 2011
3   john 2012
4   alex 2011
5   alex 2012
6    tim 2010
7    tim 2011
8    tim 2012
9  ralph 2014
10 ralph 2016

I want to count the two following things in this dataset:

    1. Groups based on all years
    1. And of these groups, the number of groups with at least one non-consecutive year

As an example for 1):

# sample output for 1)

              year count
1 2010, 2011, 2012     2
2       2011, 2012     1
3       2014, 2016     1

And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:

# sample output for 2)

              year count
1       2014, 2016     1

Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? 

EDIT: For anyone using an older version of R, @Rui Barradas provided an answer for 2) - I have included it here so that there is no ambiguity when copy/pasting:

agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, function(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))

Solution

  • Here are base R solutions.

    # 1.
    agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
    as.data.frame(table(agg$year))
    #>               Var1 Freq
    #> 1 2010, 2011, 2012    2
    #> 2       2011, 2012    1
    #> 3       2014, 2016    1
    
    # 2.
    agg <- aggregate(year ~ name, my_data, c)
    agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
    as.data.frame(table(sapply(agg, paste, collapse = ", ")))
    #>         Var1 Freq
    #> 1 2014, 2016    1
    
    # final clean up
    rm(agg)  
    

    Created on 2022-12-03 with reprex v2.0.2


    Edit

    Answering to the comment/request,

    Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?

    a way is to, in each group of name, first sort the data by year. Then run the code above.

    my_data <- my_data[order(my_data$name, my_data$year), ]