I have data in the following format:
dataset <- data.frame(taxa = c("k__Archaea| p__Crenarchaeota", "k__Archaea| p__Euryarchaeota", "k__Bacteria| p__[Thermi]"),
"11908.MM.0008.Inf.6m.Stool" =c(0,1760,0),
"11908.MM.01115.Inf.6m.Stool" =c(0,1517,0),
"11908.MM.0044.Inf.6m.Stool" =c(0,10815,0),
"11908.MM.0125.Mom.6m.Stool" = c(0,4719,0))
view(dataset)
And I would like to convert it to the following format:
fix_dataset <- data.frame(study_id = c(0008, 0115, 0044, 0125),
individual = c("Inf", "Inf", "Inf", "Mom" ),
`k__Archaea| p__Crenarchaeota` = c(0,0,0,0),
`k__Archaea| p__Euryarchaeota`= c(1760, 1517,10815, 4719),
`` = c(0,0,0,0),
timept1 = c("6m", "6m", "6m", "6m"))
view(fix_dataset)
I am trying to cut out the beginning number series 11908 and "Stool" from each column name, split out the other parts of the column names and convert from wide to long format.
I'm using the following code
library(tidyverse)
dataset %>%
pivot_longer(cols = -taxa) %>%
separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
pivot_wider(names_from = taxa,
values_from = value) %>%
select(study_id, individual, starts_with("k_"), timept1)
And I'm getting the following error message when I apply this to my data
Error in select(., study_id, individual, timept1, starts_with("k_")) :
unused arguments (study_id, individual, timept1, starts_with("k_"))
In addition: Warning messages:
1: Expected 6 pieces. Additional pieces discarded in 44 rows [242, 243, 903, 904, 1564, 1565, 2225, 2226, 2886, 2887, 3547, 3548, 4208, 4209, 4869, 4870, 5530, 5531, 6191, 6192, ...].
2: Expected 6 pieces. Missing pieces filled with `NA` in 1012 rows [74, 93, 94, 223, 224, 225, 226, 227, 228, 229, 230, 469, 470, 532, 533, 535, 536, 540, 580, 593, ...].
3: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
Does anyone have suggestions for these error messages?
You can achieve this with the following code:
library(tidyverse)
dataset %>%
pivot_longer(cols = -taxa) %>%
separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
pivot_wider(names_from = taxa,
values_from = value) %>%
select(study_id, individual, starts_with("taxa"), timept1)
which gives:
# A tibble: 4 x 6
study_id individual taxa1 taxa2 taxa3 timept1
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0008 Inf 0 1760 0 6m
2 01115 Inf 0 1517 0 6m
3 0044 Inf 0 10815 0 6m
4 0125 Mom 0 4719 0 6m
Note that there is some inconsistency in your study id, i.e. in your original data set one of the ids is "01115" and in your preferred output it is "0115".