Search code examples
rdataframedplyrtidyrmelt

Using melt or pivot_longer over 2 rows with variables


I have this output that contains the values within a family of genes (e.g. family01, family21) in distinct 'places' (bel..tej), but it also contains some metadata associated with these sites (type of Environment). I know the data is not as tidy as I wish, so I wonder if there is an elegant way to use pivot_longer to tidy this in a long format to make a heatmap later.

Here is the dataset.

Sample  bel buc mal man pen poc tej
Environment C_3 C_1 C_3 C_3 C_2 C_1 C_3
family01    1.962   20.790  0.000   0.000   0.000   0.000   1.962
family03    0.000   3.150   0.000   0.000   152.614 0.089   0.000
family08    4.482   12.603  0.168   0.460   0.000   2.917   4.482
family13    0.000   1.697   0.000   0.000   169.841 0.000   0.000
family17    0.462   10.689  0.000   0.000   0.000   1.387   0.462
family21    0.000   0.410   0.000   0.000   122.959 0.000   0.000

I searched for help with the function pivot_longer and several questions referred to multiple columns not rows and not about 'simultaneously' applying the function to two set of variables. First I tried this long <- pivot_longer(data = output.txt, cols = -c(Family), names_to = "site", values_to = "rpkm") but I obtained this result

# A tibble: 49 x 3
   Sample      site  rpkm  
   <chr>       <chr> <chr> 
 1 Environment bel   C_3    
 2 Environment buc   C_1    
 3 Environment mal   C_3    
 4 Environment man   C_3    
 5 Environment pen   C_2    
 6 Environment poc   C_1    
 7 Environment tej   C_3    
 8 family1     bel   1.962 
 9 family1     buc   20.790
10 family1     mal   0.000 

This is the expected format

  Family  site  Env  rpkm
  family1   bel  3   1.962      
  family1   buc  1   20.790 
  family1   mal  3   0.000  

I suppose that if the first row wasn't there I could do

output.txt %>%
  pivot_longer(
    -Environment,
    names_to = c(".value", "Env"),
    names_sep = "_")

In order to get a column 'Env' with the type of environment stored as a variable and then I would need 'another' conversion to long format.

I imagine two successive pivot_longer rounds could fix this or using other strategy with melt (e.g. on site and type of site as first 2 rows). This is a long table so I wanted to avoid sorting or manually editing it.

This is the output of dput:

structure(list(Sample = c("Environment", "family01", "family03", 
"family08", "family13", "family17", "family21"), bel = c("C__3", 
"1.962", "0", "4.482", "0", "0.462", "0"), buc = c("C_1", "20.79", 
"3.15", "12.603", "1.697", "10.689", "0.41"), mal = c("C_3", 
"0", "0", "0.168", "0", "0", "0"), man = c("C_3", "0", "0", "0.46", 
"0", "0", "0"), pen = c("C_2", "0", "152.614", "0", "169.841", 
"0", "122.959"), poc = c("C_1", "0", "0.089", "2.917", "0", "1.387", 
"0"), tej = c("C_3", "1.962", "0", "4.482", "0", "0.462", "0"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))

addendum in case anyone wants to use the solution I had to use a workaround saving (xlsx or csv) and loading again the tibble and specifying that the values (rpkm) are numeric.


Solution

  • Here is one option where we change the column names except the first by pasteing (str_c) the first row element after removing the C_, then remove the first row (slice(-1)), convert the type of the columns with type.convert, reshape from 'wide' to 'long' (pivot_longer), separate the 'name' column into two by splitting at the boundary between a lower case and a digit ((?<=[a-z])(?=\\d) - regex lookaround)

    library(dplyr)
    library(tidyr)
    library(stringr)
    output.txt %>% 
        rename_at(-1, ~ str_c(., unlist(output.txt %>% 
                                           slice(1) %>% 
                                           select(-Sample) %>% 
                                           unlist %>% 
                                           str_remove('C_+')))) %>% 
        slice(-1) %>%
        type.convert(as.is = TRUE) %>%
        pivot_longer(cols = -Sample, values_to = 'rpkm') %>% 
        separate(name, into = c('site', 'Env'), sep='(?<=[a-z])(?=\\d)')
    # A tibble: 42 x 4
    #   Sample   site  Env    rpkm
    #   <chr>    <chr> <chr> <dbl>
    # 1 family01 bel   3      1.96
    # 2 family01 buc   1     20.8 
    # 3 family01 mal   3      0   
    # 4 family01 man   3      0   
    # 5 family01 pen   2      0   
    # 6 family01 poc   1      0   
    # 7 family01 tej   3      1.96
    # 8 family03 bel   3      0   
    # 9 family03 buc   1      3.15
    #10 family03 mal   3      0   
    # … with 32 more rows