Search code examples
rcountuniqueaggregate

Add count of unique / distinct values by group to the original data


I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:

  color  type
1 black chair
2 black chair
3 black  sofa
4 green  sofa
5 green  sofa
6   red  sofa
7   red plate
8  blue  sofa
9  blue plate
10 blue chair

I want to add for each color, the count of unique types that are present in the data:

  color  type unique_types
1 black chair            2
2 black chair            2
3 black  sofa            2
4 green  sofa            1
5 green  sofa            1
6   red  sofa            2
7   red plate            2
8  blue  sofa            3
9  blue plate            3
10 blue chair            3

I was hoping to use ave, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.

It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame


Solution

  • Using ave (since you ask for it specifically):

    within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
    

    Make sure that type is character vector and not factor.


    Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

    require(data.table)
    setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
    # if you don't want df to be modified by reference
    ans = as.data.table(df)[, count := uniqueN(type), by = color]
    

    uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.


    Other solutions:

    Using plyr:

    require(plyr)
    ddply(df, .(color), mutate, count = length(unique(type)))
    

    Using aggregate:

    agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
    merge(df, agg, by="color", all=TRUE)