Search code examples
rreshape2strsplit

Melt dataset splitting columns by delimiter in r reshape2


I have a dataset of abundance of organisms across sites processed by 3 different people that I would like to take from wide to long format using column headers as information to split into different ID variables. here is a snippet of the dataset:

> head(reads[,1:7])
        TAXON ORRU_CL.rod ORRU_CL.louis ORRU_CL.john2 BONE_CL.rod BONE_CL.louis
1  ACENTRELLA           4             2             0           0             0
2  ACRONEURIA           0             0             0           0             0
3  ALLOCAPNIA          30            33            44         223           162
4   ALLOPERLA           0             0             0           0             0
5 AMPHINEMURA         912           659           805        6667          4905
6   ANOPHELES           0             0             0           0             0
  BONE_CL.john2
1             0
2             0
3           206
4             0
5          6343
6             0

I want to keep TAXON as an id variable, but also have a "SITE" column which is represented by the variable before the "." (ORRU_CL or BONE_CL) and then a "USER" variable which is the name of the person processing (rod, louis or john2). I know that I can split by the "." but not sure how to incorporate this into the reshape2 melt call.

With a simple melt call we get the long version with TAXON as the id variable, but Id like to make an additional two columns of "SITE" and "USER", splitting at the '.'

> long.reads<-melt(head(reads[,1:7])) 
> long.reads
         TAXON      variable value
1   ACENTRELLA   ORRU_CL.rod     4
2   ACRONEURIA   ORRU_CL.rod     0
3   ALLOCAPNIA   ORRU_CL.rod    30
4    ALLOPERLA   ORRU_CL.rod     0
5  AMPHINEMURA   ORRU_CL.rod   912
6    ANOPHELES   ORRU_CL.rod     0
7   ACENTRELLA ORRU_CL.louis     2
8   ACRONEURIA ORRU_CL.louis     0
9   ALLOCAPNIA ORRU_CL.louis    33
10   ALLOPERLA ORRU_CL.louis     0
11 AMPHINEMURA ORRU_CL.louis   659
12   ANOPHELES ORRU_CL.louis     0
13  ACENTRELLA ORRU_CL.dada2     0
14  ACRONEURIA ORRU_CL.dada2     0
15  ALLOCAPNIA ORRU_CL.dada2    44
16   ALLOPERLA ORRU_CL.dada2     0
17 AMPHINEMURA ORRU_CL.dada2   805
18   ANOPHELES ORRU_CL.dada2     0
19  ACENTRELLA   BONE_CL.rod     0
20  ACRONEURIA   BONE_CL.rod     0
21  ALLOCAPNIA   BONE_CL.rod   223
22   ALLOPERLA   BONE_CL.rod     0
23 AMPHINEMURA   BONE_CL.rod  6667
24   ANOPHELES   BONE_CL.rod     0
25  ACENTRELLA BONE_CL.louis     0
26  ACRONEURIA BONE_CL.louis     0
27  ALLOCAPNIA BONE_CL.louis   162
28   ALLOPERLA BONE_CL.louis     0
29 AMPHINEMURA BONE_CL.louis  4905
30   ANOPHELES BONE_CL.louis     0
31  ACENTRELLA BONE_CL.dada2     0
32  ACRONEURIA BONE_CL.dada2     0
33  ALLOCAPNIA BONE_CL.dada2   206
34   ALLOPERLA BONE_CL.dada2     0
35 AMPHINEMURA BONE_CL.dada2  6343
36   ANOPHELES BONE_CL.dada2     0

Any suggestions?

Thanks for your help!

LP


Solution

  • Using tidyr::pivot_longer you could split the column names on the . like:

    library(dplyr)
    library(tidyr)
    
    reads %>%
      pivot_longer(-TAXON, 
                   names_to = c('SITE','USER'),
                   names_sep = '\\.')
    #> # A tibble: 36 × 4
    #>    TAXON      SITE    USER  value
    #>    <chr>      <chr>   <chr> <int>
    #>  1 ACENTRELLA ORRU_CL rod       4
    #>  2 ACENTRELLA ORRU_CL louis     2
    #>  3 ACENTRELLA ORRU_CL john2     0
    #>  4 ACENTRELLA BONE_CL rod       0
    #>  5 ACENTRELLA BONE_CL louis     0
    #>  6 ACENTRELLA BONE_CL john2     0
    #>  7 ACRONEURIA ORRU_CL rod       0
    #>  8 ACRONEURIA ORRU_CL louis     0
    #>  9 ACRONEURIA ORRU_CL john2     0
    #> 10 ACRONEURIA BONE_CL rod       0
    #> # ℹ 26 more rows
    

    Data

    structure(list(TAXON = c("ACENTRELLA", "ACRONEURIA", "ALLOCAPNIA", 
    "ALLOPERLA", "AMPHINEMURA", "ANOPHELES"), ORRU_CL.rod = c(4L, 
    0L, 30L, 0L, 912L, 0L), ORRU_CL.louis = c(2L, 0L, 33L, 0L, 659L, 
    0L), ORRU_CL.john2 = c(0L, 0L, 44L, 0L, 805L, 0L), BONE_CL.rod = c(0L, 
    0L, 223L, 0L, 6667L, 0L), BONE_CL.louis = c(0L, 0L, 162L, 0L, 
    4905L, 0L), BONE_CL.john2 = c(0L, 0L, 206L, 0L, 6343L, 0L)), class = "data.frame", row.names = c(NA, 
    -6L))