I would like to combine a few steps into an R script to do the following:
The following steps work, but they are very clumsy I'm afraid (I'm really bad at coding) and they are not batched and have to be put in one after the other.
test <- read.table(
"1.tsv",
sep="\t", header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
As you can see, the files have to be put in by hand one at a time and the data frame "test" also seems redundant (?).
It would be great if someone could put these together in a script, which simply uses the working directory of R and goes through the files one by one, adds a new "Fusion" column, writes the new .tsv file and moves on.
Thanks already for your help!
Here is what I would do to loop over your code for each file in the pwd using your approach. Make sure that you run this script in the directory with your target .tsv files.
#!/usr/bin/Rscript
print(getwd()) ## print the pwd to the standard output to ensure that you are in the
## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Parametrized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
## you can always delete them later
fuseColumns<-function(file,cols2fuse){
test<-read.table(file,sep="\t",header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test,
file =paste0(prefix,file), # only works if preformed in pwd
## otherwise you may end up with
## Something like:
## "fused-/home/username/file/1.tsv"
sep = "\t",
quote = TRUE, ## this will suround each output
## value in quotes.
## this may not be desirable
row.names = TRUE, ## Do you really want the row names
## included?
col.names = TRUE)
file ## return the file that has been edited (this will show up in stdout
}
lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
## columns with names that
## match those in cols2fuse
Sample input
amino_acid v_gene j_gene
amino1 ENS0001001 ENS0002001
amino2 ENS0003001 ENS0004001
amino3 ENS0005001 ENS0006001
amino4 ENS0007001 ENS0008001
Is transfomed into
"amino_acid" "v_gene" "j_gene" "Fusion"
"1" "amino1" "ENS0001001" "ENS0002001" "amino1ENS0001001ENS0002001"
"2" "amino2" "ENS0003001" "ENS0004001" "amino2ENS0003001ENS0004001"
"3" "amino3" "ENS0005001" "ENS0006001" "amino3ENS0005001ENS0006001"
"4" "amino4" "ENS0007001" "ENS0008001" "amino4ENS0007001ENS0008001"
To remove the quotation around each of the elements set quote
to FALSE
and to remove the numbers at the start of each row set row.names
to FALSE
as well.
write.table(test,
file =paste0(prefix,file),
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = TRUE)
The output now looks like
amino_acid v_gene j_gene Fusion
amino1 ENS0001001 ENS0002001 amino1ENS0001001ENS0002001
amino2 ENS0003001 ENS0004001 amino2ENS0003001ENS0004001
amino3 ENS0005001 ENS0006001 amino3ENS0005001ENS0006001
amino4 ENS0007001 ENS0008001 amino4ENS0007001ENS0008001
I'm not sure if by redundant you mean you wish to remove the three columns and only display the fused column?
You can use something like this to identify the redundant columns
redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))