Search code examples
rcsvmultiple-columnspaste

R script to batch all .tsv files in a directory to have a new column with information from other columns


I would like to combine a few steps into an R script to do the following:

  1. load one .tsv file after the other (there are hundreds of them in a directory)
  2. fuse 3 specific columns in these files to make a new column "Fusion"
  3. output these files into the old .tsv file (so I don't get hundreds of new files)

The following steps work, but they are very clumsy I'm afraid (I'm really bad at coding) and they are not batched and have to be put in one after the other.

test <- read.table(
   "1.tsv",
   sep="\t", header=TRUE)

test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)

write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
                 eol = "\n", na = "NA", dec = ".", row.names = TRUE,
                 col.names = TRUE, qmethod = c("escape", "double"),
                 fileEncoding = "")

As you can see, the files have to be put in by hand one at a time and the data frame "test" also seems redundant (?).

It would be great if someone could put these together in a script, which simply uses the working directory of R and goes through the files one by one, adds a new "Fusion" column, writes the new .tsv file and moves on.

Thanks already for your help!


Solution

  • Here is what I would do to loop over your code for each file in the pwd using your approach. Make sure that you run this script in the directory with your target .tsv files.

    #!/usr/bin/Rscript
    
    print(getwd()) ## print the pwd to the standard output to ensure that you are in the
                   ## right directory
    files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
    cols2fuse<-c("amino_acid","v_gene","j_gene") ## Parametrized the columns to fuse
    prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
                     ## you can always delete them later
    
    fuseColumns<-function(file,cols2fuse){
        test<-read.table(file,sep="\t",header=TRUE)
        test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
        write.table(test,
                    file =paste0(prefix,file), # only works if preformed in pwd
                                            ## otherwise you may end up with
                                            ## Something like: 
                                            ## "fused-/home/username/file/1.tsv"                 
                    sep = "\t",
                    quote = TRUE, ## this will suround each output 
                                  ## value in quotes.
                                  ## this may not be desirable
                    row.names = TRUE, ## Do you really want the row names
                                      ## included?
                    col.names = TRUE)
        file ## return the file that has been edited (this will show up in stdout
    }
    
    lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
                                        ## columns with names that
                                        ## match those in cols2fuse
    

    Sample input

    amino_acid  v_gene  j_gene
    amino1  ENS0001001  ENS0002001
    amino2  ENS0003001  ENS0004001
    amino3  ENS0005001  ENS0006001
    amino4  ENS0007001  ENS0008001
    

    Is transfomed into

    "amino_acid"    "v_gene"    "j_gene"    "Fusion"
    "1" "amino1"    "ENS0001001"    "ENS0002001"    "amino1ENS0001001ENS0002001"
    "2" "amino2"    "ENS0003001"    "ENS0004001"    "amino2ENS0003001ENS0004001"
    "3" "amino3"    "ENS0005001"    "ENS0006001"    "amino3ENS0005001ENS0006001"
    "4" "amino4"    "ENS0007001"    "ENS0008001"    "amino4ENS0007001ENS0008001"
    

    To remove the quotation around each of the elements set quote to FALSE and to remove the numbers at the start of each row set row.names to FALSE as well.

    write.table(test,
                file =paste0(prefix,file),                  
                sep = "\t",
                quote = FALSE,
                row.names = FALSE,                       
                col.names = TRUE)
    

    The output now looks like

    amino_acid  v_gene  j_gene  Fusion
    amino1  ENS0001001  ENS0002001  amino1ENS0001001ENS0002001
    amino2  ENS0003001  ENS0004001  amino2ENS0003001ENS0004001
    amino3  ENS0005001  ENS0006001  amino3ENS0005001ENS0006001
    amino4  ENS0007001  ENS0008001  amino4ENS0007001ENS0008001
    

    I'm not sure if by redundant you mean you wish to remove the three columns and only display the fused column?

    You can use something like this to identify the redundant columns

    redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))