Search code examples
rdata.tableaggregateones-complement

R, Complement on aggregating data.table


Is it possible to aggregate with a complement in R data.tables. Example below.

library(data.table)
dt <- data.table(a=c("word1","word2","word2","word2"), b=c("cat1","cat1","cat1","cat2"))

To get number of particular words in a category

newdt <- dt[,(.N),by=.(a,b)]
#word1,cat1 - 1
#word2,cat1 - 2
#word2,cat2 - 1

How could I count the number of all other words in the category? Or relatedly, number of other categories that the word is in? Something like the following?

#doesn't work
#newdt2 <- dt[a!=a,(.N),by=.(a,b)]
#the expected answer would be
#word1,cat1 - 2
#word2,cat1 - 1
#word2,cat2 - 0

I can't find any help on this in online tutorials or questions. Is there an easy way to get the complement. Data.table solution would be nice, as working with a 50M row table. Thanks!


Solution

  • Following the idea of Bruno to compute the difference of total count per category minus the count of words in each category but using data.table syntax with an update on join this becomes a "one-liner":

    library(data.table)
    dt <-data.table(a = c("word1", rep("word2", 3L)), b = c(rep("cat1", 3L), "cat2"))
    dt[, .N, by = .(a, b)][dt[, .N, by = b], on = "b", Nc := i.N - N][]
    
           a    b N Nc
    1: word1 cat1 1  2
    2: word2 cat1 2  1
    3: word2 cat2 1  0