Search code examples
rdataframereshapereshape2tidyr

Tidy data.frame with repeated column names


I have a program that gives me data in this format

toy
                file_path Condition Trial.Num A B  C  ID A B  C   ID  A B  C    ID
1     root/some.extension  Baseline         1 2 3  5 car 2 1  7 bike  4 9  0 plane
2    root/thing.extension  Baseline         2 3 6 45 car 5 4  4 bike  9 5  4 plane
3     root/else.extension  Baseline         3 4 4  6 car 7 5  4 bike 68 7 56 plane
4 root/uniquely.extension Treatment         1 5 3  7 car 1 7 37 bike  9 8  7 plane
5  root/defined.extension Treatment         2 6 7  3 car 4 6  8 bike  9 0  8 plane

My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names

tidy_toy
                 file_path Condition Trial.Num  A B  C    ID
1      root/some.extension  Baseline         1  2 3  5   car
2     root/thing.extension  Baseline         2  3 6 45   car
3      root/else.extension  Baseline         3  4 4  6   car
4  root/uniquely.extension Treatment         1  5 3  7   car
5   root/defined.extension Treatment         2  6 7  3   car
6      root/some.extension  Baseline         1  2 1  7  bike
7     root/thing.extension  Baseline         2  5 4  4  bike
8      root/else.extension  Baseline         3  7 5  4  bike
9  root/uniquely.extension Treatment         1  1 7 37  bike
10  root/defined.extension Treatment         2  4 6  8  bike
11     root/some.extension  Baseline         1  4 9  0 plane
12    root/thing.extension  Baseline         2  9 5  4 plane
13     root/else.extension  Baseline         3 68 7 56 plane
14 root/uniquely.extension Treatment         1  9 8  7 plane
15  root/defined.extension Treatment         2  9 0  8 plane

If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.

Here's the dput of both tables

   structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
    "root/else.extension", "root/some.extension", "root/thing.extension", 
    "root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
    1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"), 
        Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L, 
        4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L, 
        1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L, 
        5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L, 
        4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"), 
        A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L, 
        4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path", 
    "Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C", 
    "ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
    -5L))


structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L, 
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
"root/else.extension", "root/some.extension", "root/thing.extension", 
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline", 
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L, 
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L, 
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L, 
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L, 
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike", 
"car", "plane"), class = "factor")), .Names = c("file_path", 
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
-15L))

Solution

  • You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

    # make the column names unique
    names(toy) <- make.unique(names(toy))
    # let the 'Condition' column start with a small letter 'c'
    # so it won't be detected by the patterns argument from melt
    names(toy)[2] <- tolower(names(toy)[2])
    
    # load the 'data.table' package
    library(data.table)
    # tidy the data into long format
    tidy_toy <- melt(setDT(toy), 
                     measure.vars = patterns('^A','^B','^C','^ID'), 
                     value.name = c('A','B','C','ID'))
    

    which gives:

     > tidy_toy
                      file_path condition Trial.Num variable  A B  C    ID
     1:     root/some.extension  Baseline         1        1  2 3  5   car
     2:    root/thing.extension  Baseline         2        1  3 6 45   car
     3:     root/else.extension  Baseline         3        1  4 4  6   car
     4: root/uniquely.extension Treatment         1        1  5 3  7   car
     5:  root/defined.extension Treatment         2        1  6 7  3   car
     6:     root/some.extension  Baseline         1        2  2 1  7  bike
     7:    root/thing.extension  Baseline         2        2  5 4  4  bike
     8:     root/else.extension  Baseline         3        2  7 5  4  bike
     9: root/uniquely.extension Treatment         1        2  1 7 37  bike
    10:  root/defined.extension Treatment         2        2  4 6  8  bike
    11:     root/some.extension  Baseline         1        3  4 9  0 plane
    12:    root/thing.extension  Baseline         2        3  9 5  4 plane
    13:     root/else.extension  Baseline         3        3 68 7 56 plane
    14: root/uniquely.extension Treatment         1        3  9 8  7 plane
    15:  root/defined.extension Treatment         2        3  9 0  8 plane
    

    Another option is to use a list of column-indexes for measure.vars:

    tidy_toy <- melt(setDT(toy), 
                     measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)), 
                     value.name = c('A','B','C','ID'))
    

    Making the column-names unique isn't necessary then.


    A more complicated method that creates names that are better distinguishable by the patterns argument:

    # select the names that are not unique
    tt <- table(names(toy))
    idx <- which(names(toy) %in% names(tt)[tt > 1])
    nms <- names(toy)[idx]
    
    # make them unique
    names(toy)[idx] <- paste(nms, 
                             rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                                 each = length(names(tt)[tt > 1])), 
                             sep = '.')
    
    # your columnnames are now unique:
    > names(toy)
     [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
     [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     
    
    # tidy the data into long format
    tidy_toy <- melt(setDT(toy), 
                     measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                     value.name = c('A','B','C','ID'))
    

    which will give the same end-result.


    As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.