Search code examples
rloopsrepeatdata-manipulationdry

How do I repeat codes with names changing at every block? (with R)


I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.

In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.

I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.

Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).

PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).

library(tidyverse)
library(labelled)

mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")

#GG97
 collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
  collated <- tail(collated,5); collated <- collated[,-c(1:3)]
  collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]

  labels <- t(mapfile)
  colnames(collated_reorder) <- labels[2,]

  mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
  mean = as.matrix(mean); mean <- t(mean)

  GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
  GG97_shannon <- t(GG97_shannon); 

  DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
  GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
  colnames(GG97_shannon) <- c("DB","Group","value")
  rm(collated,collated_reorder,DB_type,labels,mean)

Here I paste all the outputs together, freeze the order and make the boxplot.

alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
  rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)

    alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
    alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))

library(ggplot2)
ggplot(data = alpha_shannon) +
  aes(x = DB, y = value, colour = Group) +
  geom_boxplot()+
  labs(title = 'Shannon',
       x = 'Database',
       y = 'Diversity') +
  theme(legend.position = 'bottom')+ 
  theme_grey(base_size = 16) 

How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!


Solution

  • You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.

    An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:

    process <- function(file, DB){
      # -> Use the function parameter `file` instead of a hardcoded filename
      collated <- read.delim(file=file, check.names=FALSE);  
      collated <- tail(collated,5); collated <- collated[,-c(1:3)]
      collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
    
      labels <- t(mapfile)
      colnames(collated_reorder) <- labels[2,]
    
      mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
      mean = as.matrix(mean); mean <- t(mean)
    
      # -> rename this variable to a more general name, e.g. `result`
      result <- as.data.frame(rbind(labels[2,],mean))
      result <- t(result); 
    
      # -> Use the function parameter `DB` instead of a hardcoded string
      DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
      result <- as.data.frame(cbind(DB_type,result))
      colnames(result) <- c("DB","Group","value")
    
      # -> After the end of this function, the variables defined in this function
      #    vanish automatically, you just need to specify the result
      return(result)
    }
    

    Now you can reuse that block:

    GG97_shannon      <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
    GG100_shannon     <- process(file =...., DB = ....)
    SILVA97_shannon   <- ...
    SILVA100_shannon  <- ...
    NCBI_shannon      <- ...
    RDP_shannon       <- ...
    

    Alternatively, you can use looping structures:

    • General-purpose for:

      datasets <-  c("GG97_shannon", "GG100_shannon", "SILVA97_shannon", 
                     "SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
      files    <-  c("alpha_diversity/GG97_shannon.txt", .....)
      DBs      <-  c("GG97", ....)
      result   <-  list()
      
      for(i in seq_along(datasets)){
         result[[datasets[i]]] <- process(files[i], DBs[i])
      }
      
    • mapply, a "specialized for" for looping over several vectors in parallel:

      # the first argument is the function from above, the other ones are given as arguments
      # to our process(.) function
      results <- mapply(process, files, DBs)