Search code examples
rdataframeggplot2tidyversedata-wrangling

Using R, how to regroup multiple dataframe columns into a smaller number of new columns


I am carrying out some exercises from a very good online R/bioinformatics course. To this end I am wrangling with data in the form of a 'SummarizedExperiment' object from a Bioconductor package of the same name. The rows consist of gene names and gene expression values; the columns consist of 9 ctrl (control) samples, 9 'drug1' treated samples and 9 'drug5' treated samples. Here is what the table looks like: dataframe of 3 genes and their expression under various conditions - by way of example only control (CTRL) conditions are shown The task is to regroup data in this dataframe so that CTRL0_1 - CTRL0_9 are placed in a single column, named 'CTRL0'. In the same fashion, new 'DRUG1' and 'DRUG5' named columns are needed consisting of gene expression for each gene in the columns DRUG1_1 - DRUG1_9 and DRUG5_1 - DRUG5_9, respectively. Data are derived from the final question on this webpage: https://uclouvain-cbio.github.io/WSBIM1207/sec-bioinfo.html The task is to generate a ggplot like this:Correct distribution of the gene expression in each group Instead, with my inelegant code I get this:MY plot of distribution in each group (ignore the different x-axis labels, they amount to the same as in the graph above) To generate MY plot, I used this code:

table_gene_gp_expr <- geneCol_filter_assaySeDf %>% 
  pivot_longer(cols = CTRL0_1:CTRL0_9,
               names_to = "ctrl0",
               values_to = "ctrl0_expr") %>% 
  pivot_longer(cols = DRUG1_1:DRUG1_9,
               names_to = "drug1",
               values_to = "drug1_expr") %>% 
  pivot_longer(cols = DRUG5_1:DRUG5_9,
               names_to = "drug5",
               values_to = "drug5_expr") %>% 
  pivot_longer(cols = c(ctrl0_expr, drug1_expr, drug5_expr),
               names_to = "group",
               values_to = "gene_expression") %>% 
  select(gene, group, gene_expression)

I have used consecutive pivot_longer() commands, but I wonder if there is a more efficient way to generate the table below that I used for MY ggplot. Note the data in MY gggplot and the correct ggplot are different, which I can't explain. The head() of the table I used looks like this:dataframe I used to generate MY plot I can post the code I wrote to generate the 'wide' dataframe of the 1st table shown if it's really needed.


Solution

  • Given sample data like this:

    tibble(gene = c("gene1", "gene2"),
       CTRL0_1 = 1:2,
       CTRL0_2 = 3:4,
       DRUG1_1 = 5:6,
       DRUG1_2 = 7:8,
       DRUG5_1 = 9:10) -> test
    

    Here are two variations of syntax for pivoting all your data columns and then splitting them up by group. You'd need a different approach if the data in those columns had different types.

    test %>%
      pivot_longer(cols = -gene) %>%
      separate(name, c("group", "obs"), convert = TRUE) %>%
      pivot_wider(names_from = group, values_from = value)
    
    test %>%
      pivot_longer(cols = -gene, names_to = c("group", "obs"), names_sep = "_") %>%
      pivot_wider(names_from = group, values_from = value)