Search code examples
rreshapetransposestrsplitcsv

How to split column names and drop parts of the names and convert data from wide to long format in R


I have data in the following format:

dataset <- data.frame(taxa = c("k__Archaea| p__Crenarchaeota", "k__Archaea| p__Euryarchaeota", "k__Bacteria| p__[Thermi]"),
                      "11908.MM.0008.Inf.6m.Stool" =c(0,1760,0),
                      "11908.MM.01115.Inf.6m.Stool" =c(0,1517,0),
                      "11908.MM.0044.Inf.6m.Stool" =c(0,10815,0),
                      "11908.MM.0125.Mom.6m.Stool" = c(0,4719,0))
view(dataset)

And I would like to convert it to the following format:

fix_dataset <- data.frame(study_id = c(0008, 0115, 0044, 0125),
individual = c("Inf", "Inf", "Inf", "Mom" ),
`k__Archaea| p__Crenarchaeota` = c(0,0,0,0),
 `k__Archaea| p__Euryarchaeota`= c(1760, 1517,10815, 4719),
`` = c(0,0,0,0),
timept1 = c("6m", "6m", "6m", "6m"))

view(fix_dataset)

I am trying to cut out the beginning number series 11908 and "Stool" from each column name, split out the other parts of the column names and convert from wide to long format.

I'm using the following code

library(tidyverse)
dataset %>%
  pivot_longer(cols = -taxa) %>%
  separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
  pivot_wider(names_from = taxa,
              values_from = value) %>%
  select(study_id, individual, starts_with("k_"), timept1)

And I'm getting the following error message when I apply this to my data

Error in select(., study_id, individual, timept1, starts_with("k_")) : 
  unused arguments (study_id, individual, timept1, starts_with("k_"))
In addition: Warning messages:
1: Expected 6 pieces. Additional pieces discarded in 44 rows [242, 243, 903, 904, 1564, 1565, 2225, 2226, 2886, 2887, 3547, 3548, 4208, 4209, 4869, 4870, 5530, 5531, 6191, 6192, ...]. 
2: Expected 6 pieces. Missing pieces filled with `NA` in 1012 rows [74, 93, 94, 223, 224, 225, 226, 227, 228, 229, 230, 469, 470, 532, 533, 535, 536, 540, 580, 593, ...]. 
3: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 

Does anyone have suggestions for these error messages?


Solution

  • You can achieve this with the following code:

    library(tidyverse)
    dataset %>%
      pivot_longer(cols = -taxa) %>%
      separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
      pivot_wider(names_from = taxa,
                  values_from = value) %>%
      select(study_id, individual, starts_with("taxa"), timept1)
    

    which gives:

    # A tibble: 4 x 6
      study_id individual taxa1 taxa2 taxa3 timept1
      <chr>    <chr>      <dbl> <dbl> <dbl> <chr>  
    1 0008     Inf            0  1760     0 6m     
    2 01115    Inf            0  1517     0 6m     
    3 0044     Inf            0 10815     0 6m     
    4 0125     Mom            0  4719     0 6m 
    

    Note that there is some inconsistency in your study id, i.e. in your original data set one of the ids is "01115" and in your preferred output it is "0115".