I have this output that contains the values within a family of genes (e.g. family01, family21) in distinct 'places' (bel..tej), but it also contains some metadata associated with these sites (type of Environment). I know the data is not as tidy as I wish, so I wonder if there is an elegant way to use pivot_longer
to tidy this in a long format to make a heatmap later.
Here is the dataset.
Sample bel buc mal man pen poc tej
Environment C_3 C_1 C_3 C_3 C_2 C_1 C_3
family01 1.962 20.790 0.000 0.000 0.000 0.000 1.962
family03 0.000 3.150 0.000 0.000 152.614 0.089 0.000
family08 4.482 12.603 0.168 0.460 0.000 2.917 4.482
family13 0.000 1.697 0.000 0.000 169.841 0.000 0.000
family17 0.462 10.689 0.000 0.000 0.000 1.387 0.462
family21 0.000 0.410 0.000 0.000 122.959 0.000 0.000
I searched for help with the function pivot_longer
and several questions referred to multiple columns not rows and not about 'simultaneously' applying the function to two set of variables. First I tried this long <- pivot_longer(data = output.txt, cols = -c(Family), names_to = "site", values_to = "rpkm")
but I obtained this result
# A tibble: 49 x 3
Sample site rpkm
<chr> <chr> <chr>
1 Environment bel C_3
2 Environment buc C_1
3 Environment mal C_3
4 Environment man C_3
5 Environment pen C_2
6 Environment poc C_1
7 Environment tej C_3
8 family1 bel 1.962
9 family1 buc 20.790
10 family1 mal 0.000
This is the expected format
Family site Env rpkm
family1 bel 3 1.962
family1 buc 1 20.790
family1 mal 3 0.000
I suppose that if the first row wasn't there I could do
output.txt %>%
pivot_longer(
-Environment,
names_to = c(".value", "Env"),
names_sep = "_")
In order to get a column 'Env' with the type of environment stored as a variable and then I would need 'another' conversion to long format.
I imagine two successive pivot_longer rounds could fix this or using other strategy with melt
(e.g. on site and type of site as first 2 rows). This is a long table so I wanted to avoid sorting or manually editing it.
This is the output of dput:
structure(list(Sample = c("Environment", "family01", "family03",
"family08", "family13", "family17", "family21"), bel = c("C__3",
"1.962", "0", "4.482", "0", "0.462", "0"), buc = c("C_1", "20.79",
"3.15", "12.603", "1.697", "10.689", "0.41"), mal = c("C_3",
"0", "0", "0.168", "0", "0", "0"), man = c("C_3", "0", "0", "0.46",
"0", "0", "0"), pen = c("C_2", "0", "152.614", "0", "169.841",
"0", "122.959"), poc = c("C_1", "0", "0.089", "2.917", "0", "1.387",
"0"), tej = c("C_3", "1.962", "0", "4.482", "0", "0.462", "0"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))
addendum in case anyone wants to use the solution I had to use a workaround saving (xlsx or csv) and loading again the tibble and specifying that the values (rpkm) are numeric.
Here is one option where we change the column names except the first by paste
ing (str_c
) the first row element after removing the C_
, then remove the first row (slice(-1)
), convert the type
of the columns with type.convert
, reshape from 'wide' to 'long' (pivot_longer
), separate
the 'name' column into two by splitting at the boundary between a lower case and a digit ((?<=[a-z])(?=\\d)
- regex lookaround)
library(dplyr)
library(tidyr)
library(stringr)
output.txt %>%
rename_at(-1, ~ str_c(., unlist(output.txt %>%
slice(1) %>%
select(-Sample) %>%
unlist %>%
str_remove('C_+')))) %>%
slice(-1) %>%
type.convert(as.is = TRUE) %>%
pivot_longer(cols = -Sample, values_to = 'rpkm') %>%
separate(name, into = c('site', 'Env'), sep='(?<=[a-z])(?=\\d)')
# A tibble: 42 x 4
# Sample site Env rpkm
# <chr> <chr> <chr> <dbl>
# 1 family01 bel 3 1.96
# 2 family01 buc 1 20.8
# 3 family01 mal 3 0
# 4 family01 man 3 0
# 5 family01 pen 2 0
# 6 family01 poc 1 0
# 7 family01 tej 3 1.96
# 8 family03 bel 3 0
# 9 family03 buc 1 3.15
#10 family03 mal 3 0
# … with 32 more rows