Search code examples
rreshape2dcast

How to split one column into different columns with dcast without aggregating?


I'm trying to reshape my data using dcast. I'm working with samples where each sample has 10-30 sample units. I can't have my data aggregate.

My data is in this format:

ID  total
sample_1    1
sample_1    0
sample_1    2
sample_1    1
sample_1    0
sample_1    0
sample_1    2
sample_1    1
sample_1    0
sample_1    2
sample_1    1
sample_1    4
sample_2    2
sample_2    1
sample_2    2
sample_2    0
sample_2    0
sample_2    0
sample_2    1
sample_2    2
sample_2    1
sample_2    4
sample_2    5
sample_2    2
sample_2    1
sample_3    0
sample_3    0
sample_3    1
sample_3    2
sample_3    1
sample_3    0
sample_3    2
sample_3    1
sample_3    4
sample_3    5
sample_3    1
sample_3    1
sample_3    0
sample_3    0
sample_3    1

And I want it to looks like it:

sample_1    sample_2    sample_3
1           2           0
0           1           0
2           2           1
1           0           2
0           0           1
0           0           0
2           1           2
1           2           1
0           1           4
2           4           5
1           5           1
4           2           1
            1           0
                        0
                        1

Where my sample ID's turn into different columns.

I tried in several ways but R keep aggregating it.


Solution

  • You can do this with dcast() but you have to add row numbers for each ID.

    The data.table package is another package besides reshape2 which implements dcast(). data.table has a handy rowid() function to generate unique row ids within each group. WIth that, we get:

    library(data.table)
    dcast(setDT(DF), rowid(ID) ~ ID, value.var = "total")
    #    ID sample_1 sample_2 sample_3
    # 1:  1        1        2        0
    # 2:  2        0        1        0
    # 3:  3        2        2        1
    # 4:  4        1        0        2
    # 5:  5        0        0        1
    # 6:  6        0        0        0
    # 7:  7        2        1        2
    # 8:  8        1        2        1
    # 9:  9        0        1        4
    #10: 10        2        4        5
    #11: 11        1        5        1
    #12: 12        4        2        1
    #13: 13       NA        1        0
    #14: 14       NA       NA        0
    #15: 15       NA       NA        1
    

    However, I recommend to continue any data processing in long format and use grouping. That's much easier than to work on individual columns. For instance,

    # count observations by group
    DF[, .N, by = ID]
    #         ID  N
    #1: sample_1 12
    #2: sample_2 13
    #3: sample_3 15
    
    # compute mean by group
    DF[, mean(total), by = ID]
    #         ID       V1
    #1: sample_1 1.166667
    #2: sample_2 1.615385
    #3: sample_3 1.266667
    
    # get min and max by group
    DF[, .(min = min(total), max = max(total)), by = ID]
    #         ID min max
    #1: sample_1   0   4
    #2: sample_2   0   5
    #3: sample_3   0   5
    
    # the same using range()
    DF[, as.list(range(total)), by = ID]
    #         ID V1 V2
    #1: sample_1  0  4
    #2: sample_2  0  5
    #3: sample_3  0  5
    

    Data

    DF <- structure(list(ID = c("sample_1", "sample_1", "sample_1", "sample_1", 
    "sample_1", "sample_1", "sample_1", "sample_1", "sample_1", "sample_1", 
    "sample_1", "sample_1", "sample_2", "sample_2", "sample_2", "sample_2", 
    "sample_2", "sample_2", "sample_2", "sample_2", "sample_2", "sample_2", 
    "sample_2", "sample_2", "sample_2", "sample_3", "sample_3", "sample_3", 
    "sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3", 
    "sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3"
    ), total = c(1L, 0L, 2L, 1L, 0L, 0L, 2L, 1L, 0L, 2L, 1L, 4L, 
    2L, 1L, 2L, 0L, 0L, 0L, 1L, 2L, 1L, 4L, 5L, 2L, 1L, 0L, 0L, 1L, 
    2L, 1L, 0L, 2L, 1L, 4L, 5L, 1L, 1L, 0L, 0L, 1L)), .Names = c("ID", 
    "total"), row.names = c(NA, -40L), class = "data.frame")