Search code examples
rcountdata.tableunique

How to cumulatively track unique combinations by group in R


I have observations with unique identifiers that are associated with one or more START/END date pair. These observations are iterated per ID by month within the date range. An example for one unique ID and category, truncated for length.

  ID       START        END    MONTH CAT.A
10056 2004-01-08 2005-01-07 Jan 2004 
10056 2004-01-08 2005-01-07 Feb 2004 
10056 2004-01-08 2005-01-07 Mar 2004 
...
10056 2004-01-08 2005-01-07 Nov 2004 
10056 2004-01-08 2005-01-07 Dec 2004 
10056 2004-01-08 2005-01-07 Jan 2005 
--------------------------------------
10056 2006-11-28 2008-02-20 Nov 2006 
10056 2006-11-28 2008-02-20 Dec 2006 
10056 2006-11-28 2008-02-20 Jan 2007 
...
10056 2006-11-28 2008-02-20 Dec 2007 
10056 2006-11-28 2008-02-20 Jan 2008 
10056 2006-11-28 2008-02-20 Feb 2008 
--------------------------------------
10056 2010-01-30 2011-02-03 Jan 2010 
10056 2010-01-30 2011-02-03 Feb 2010 
10056 2010-01-30 2011-02-03 Mar 2010 
...
10056 2010-01-30 2011-02-03 Dec 2010 
10056 2010-01-30 2011-02-03 Jan 2011 
10056 2010-01-30 2011-02-03 Feb 2011 

The solution I am looking for would cumulatively count each unique event occurrence of CAT.A. In the first date range CAT.A would be 1, increment CAT.A to 2 in the second date range, and 3 in the third. This counter would be unique to this ID and be NA otherwise

  ID       START        END    MONTH CAT.A
10056 2004-01-08 2005-01-07 Jan 2004 1
10056 2004-01-08 2005-01-07 Feb 2004 1
10056 2004-01-08 2005-01-07 Mar 2004 1
...
10056 2004-01-08 2005-01-07 Nov 2004 1
10056 2004-01-08 2005-01-07 Dec 2004 1
10056 2004-01-08 2005-01-07 Jan 2005 1
--------------------------------------
10056 2006-11-28 2008-02-20 Nov 2006 2
10056 2006-11-28 2008-02-20 Dec 2006 2
10056 2006-11-28 2008-02-20 Jan 2007 2
...
10056 2006-11-28 2008-02-20 Dec 2007 2
10056 2006-11-28 2008-02-20 Jan 2008 2
10056 2006-11-28 2008-02-20 Feb 2008 2
--------------------------------------
10056 2010-01-30 2011-02-03 Jan 2010 3
10056 2010-01-30 2011-02-03 Feb 2010 3
10056 2010-01-30 2011-02-03 Mar 2010 3
...
10056 2010-01-30 2011-02-03 Dec 2010 3
10056 2010-01-30 2011-02-03 Jan 2011 3
10056 2010-01-30 2011-02-03 Feb 2011 3

The dataset has millions of other unique IDs and 11 other categories, but if I can find a solution for this subset, I should be able to apply it to the entire dataset.

I've found solutions that will let me count the total number of unique combinations of ID,START,END, but nothing that will help increment A in each observation once only when it belongs to a new unique START,END event.

I have been using data.table and lubridate.


Solution

  • How about this?

    d = data.table(
        ID = c(rep(1,5), rep(2,5)),
        CAT = c(1,1,1,2,2,1,1,2,3,4)
        )
    
    d[, N_Unique := cumsum(!duplicated(CAT)), by = ID]
    
    > d
        ID CAT N_Unique
     1:  1   1        1
     2:  1   1        1
     3:  1   1        1
     4:  1   2        2
     5:  1   2        2
     6:  2   1        1
     7:  2   1        1
     8:  2   2        2
     9:  2   3        3
    10:  2   4        4
    

    If you then wanted a separate column for each value of ID (which seems quite odd, given that you have millions of unique values), you could use something like this:

    d[, ID := as.factor(ID)]
    > cbind(d, diag(d[,N_Unique]) %*% model.matrix(~ ID - 1, d))
        ID CAT N_Unique ID1 ID2
     1:  1   1        1   1   0
     2:  1   1        1   1   0
     3:  1   1        1   1   0
     4:  1   2        2   2   0
     5:  1   2        2   2   0
     6:  2   1        1   0   1
     7:  2   1        1   0   1
     8:  2   2        2   0   2
     9:  2   3        3   0   3
    10:  2   4        4   0   4