Search code examples
rarraysvectordata.tablestring-aggregation

Aggregate character string into vector in R


I have a data table test:

id key
1 2365
1 2365
1 3709
2 6734
2 1908
2 4523

I want to aggregate unique key values by id into vector using data.table package.

Expected output:

id key_array
1 "2365", "3709"
2 "6734", "1908", "4523"

So, this should work like array_agg sql function.

I tried:
res <- test[, list(key_array = paste(unique(key), collapse = ", ")), by = "id"], but I get just a string. But I need to have opportunity to find the length of each vector and operate with its certain elements (find the intersection of two vectors for example).


Solution

  • 1. Base R

    This an aggregate one-liner.

    x <- 'id    key
    1   2365
    1   2365
    1   3709
    2   6734
    2   1908
    2   4523'
    test <- read.table(textConnection(x), header = TRUE)
    
    aggregate(key ~ id, test, \(x) c(unique(x)))
    #>   id              key
    #> 1  1       2365, 3709
    #> 2  2 6734, 1908, 4523
    

    Created on 2022-06-14 by the reprex package (v2.0.1)

    But if user @Chris's comment is right then the right solution as follows.

    aggregate(key ~ id, test, \(x) paste(unique(x), collapse = ", "))
    

    Note that both c(unique(x)) and as.character(c(unique(x))) will output a list column, so the latter solution is right anyway.


    2. Package data.table

    Once again a one-liner.

    The output is a list column, with each list member an integer vector. To keep as integers use

    list(unique(key))
    

    instead.

    suppressPackageStartupMessages(library(data.table))
    
    res <- setDT(test)[, .(key_array = list(as.character(unique(key)))), by = id]
    res
    #>    id      key_array
    #> 1:  1      2365,3709
    #> 2:  2 6734,1908,4523
    
    str(res)
    #> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
    #>  $ id       : int  1 2
    #>  $ key_array:List of 2
    #>   ..$ : chr  "2365" "3709"
    #>   ..$ : chr  "6734" "1908" "4523"
    #>  - attr(*, ".internal.selfref")=<externalptr>
    

    Created on 2022-06-14 by the reprex package (v2.0.1)

    Then, in order to access the vectors use two extractors, one to extract the column and the other one to extract the vectors.

    res$key_array[[1]]
    #> [1] "2365" "3709"
    res$key_array[[2]]
    #> [1] "6734" "1908" "4523"
    

    Created on 2022-06-14 by the reprex package (v2.0.1)


    3. dplyr solution

    Group by id and collapse the unique strings into one only.

    suppressPackageStartupMessages(library(dplyr))
    
    test %>%
      group_by(id) %>%
      summarise(key_array = paste(unique(key), collapse = ", "))
    #> # A tibble: 2 × 2
    #>      id key_array  
    #>   <int> <chr>           
    #> 1     1 2365, 3709      
    #> 2     2 6734, 1908, 4523
    

    Created on 2022-06-14 by the reprex package (v2.0.1)