Search code examples
rstringdataframerep

Replicate column names and splice


I have a dataframe as follows:

df <- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))

#   A B
# 1 2 0
# 2 0 3
# 3 1 2

The number in each cell indicates the times for which the corresponding column name should repeat. The replicates should be spliced by semicolons(;) to a single string. The expected output turns out to

#      A     B
# 1  A;A  <NA>
# 2 <NA> B;B;B
# 3    A   B;B

I'm searching a efficient way to deal with a much larger dataset:

set.seed(1234)
df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) <- LETTERS

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2  1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3  5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached 'max' / getOption("max.print") -- omitted 9997 rows ]

I prefer base or tidyverse solutions. data.table is welcome but I'm unfamiliar to it though.


Solution

  • One base option:

    res <- df
    n <- unlist(res)
    res[res > 0] <- sapply(
      split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
      paste, collapse = ';'
    )
    res[res == 0] <- NA
    
    res[1:5, 1:5]
    #           A         B       C         D       E
    # 1     A;A;A      <NA>   C;C;C D;D;D;D;D E;E;E;E
    # 2         A B;B;B;B;B       C      <NA>   E;E;E
    # 3 A;A;A;A;A B;B;B;B;B     C;C      <NA>       E
    # 4   A;A;A;A   B;B;B;B    <NA>         D E;E;E;E
    # 5     A;A;A         B C;C;C;C       D;D   E;E;E