Search code examples
rdata.tableuniqueaggregate

remove duplicate rows then aggregate into one row by group using datatable


I have duplicate rows in a data.table - see below. A couple of the variables are lists.

ID  l29 s14 s2  s7  s71 s91
12345   NULL    3   NULL    NULL    NULL    NULL
12345   NULL    NULL    1   NULL    NULL    NULL
12345   NULL    NULL    NULL    2   NULL    NULL
12345   NULL    NULL    NULL    NULL    c(4.5, 8, 9, 10)    NULL
12345   NULL    NULL    NULL    NULL    c(4.5, 8, 9, 10)    NULL
12345   NULL    NULL    NULL    NULL    c(4.5, 8, 9, 10)    NULL
12345   NULL    NULL    NULL    NULL    NULL    c(6, 7)
12345   11  NULL    NULL    NULL    NULL    NULL
12345   NULL    NULL    NULL    NULL    c(4.5, 8, 9, 10)    NULL
12345   NULL    NULL    NULL    NULL    NULL    c(6, 7)

I would like to summarize the above into one row as below:

12345   11  3   1   2   c(4.5, 8, 9, 10)    c(6, 7)

Someone asked me to dput the data, I hope this does it?

structure(list(SSN_TAX_ID = c("12345", "12345", "12345", "12345", 
"12345", "12345", "12345", "12345", "12345", "12345"), L29 = list(
    NULL, NULL, NULL, NULL, NULL, NULL, NULL, 11, NULL, NULL), 
    S14 = list(3, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 
        NULL), S2 = list(NULL, 1, NULL, NULL, NULL, NULL, NULL, 
        NULL, NULL, NULL), S7 = list(NULL, NULL, 2, NULL, NULL, 
        NULL, NULL, NULL, NULL, NULL), S71 = list(NULL, NULL, 
        NULL, c(4.5, 8, 9, 10), c(4.5, 8, 9, 10), c(4.5, 8, 9, 
        10), NULL, NULL, c(4.5, 8, 9, 10), NULL), S91 = list(
        NULL, NULL, NULL, NULL, NULL, NULL, c(6, 7), NULL, NULL, 
        c(6, 7))), .Names = c("SSN_TAX_ID", "L29", "S14", "S2", 
"S7", "S71", "S91"), class = "data.frame", row.names = c(NA, 
-10L))

Solution

  • This is pretty messy, but:

    setDT(DT) # since the OP posted a data frame
    DT[, lapply(.SD, function(x) unique(Filter(Negate(is.null), x)))]
    #    SSN_TAX_ID L29 S14 S2 S7                 S71 S91
    # 1:      12345  11   3  1  2  4.5, 8.0, 9.0,10.0 6,7
    

    Filter(Negate(is.null), x) drops the NULLs from each column before we uniquify.


    In the special case of the OP's example, using Reduce with union also happens to work:

    DT[, lapply(.SD, function(x) .(Reduce(union, x)))]