So, I am new to R.
I am trying to make something that reads a patient ID from a list of patients listx, extracts the corresponding column (column name = name of patient) in matrix 1 (genomicmatrix), and then runs a row-by-row statistical analysis between that column from genomicmatrix and another matrix (mxy), both of which have the same number of rows.
THEN, it writes the results into a CSV file.
THEN, it moves on to the next patient in listx and repeats the procedure.
Again, I'm new at this so I hope i'm being clear. Here's what I have so far:
for(i in seq_along(mxy)){
for (j in seq_along(listx)){
indgene <- try(gex[,listx[j][listx[j] %in% names(gex)]])
}
zvalues[i] <- (indgene[i] - mean(mxy[i,])) / sd(mxy[i,])
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL, stringsAsFactors = FALSE)
write.table(geneexptest$gex.sample, paste(names(listx)[j], ".csv", sep = ","),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
zvalues = NULL
indgene = NULL
geneexptest = NULL
}
SO I know it's kind of a mess. And it doesn't work. It just builds up a bunch of NAs in zvalue endlessly. I want it to build one indgene vector, use it to fill zvalues for THAT patient ONLY, make a dataframe and write it as a csv, then delete ALL that stuff and continue with the NEXT patient.
One more thing - is there any way to make the name of the CSV file change for every run (like name it after the patient ID currently being looked at?), so that the final output is x number of CSV files, each corresponding to a patient in listx.
THANKS SO MUCH!!
gex
:
sample TCGA-F4-6703-01 TCGA-DM-A28E-01 TCGA-AY-6197-01 TCGA-A6-5657-01
[1,] 987 0.79790041 2.3517004 1.7580004 0.6067004
[2,] 7829 -1.13418473 -1.4130847 -2.3078847 0.2550153
[3,] 15097 -0.45561492 -0.4556149 -0.4556149 -0.4556149
[4,] 15056 0.03217751 -0.1146225 0.1363775 -0.3028225
[5,] 15058 -0.31903849 -1.2251385 -1.2339385 -0.8575385
[6,] 15072 -0.19546513 -0.4911651 -0.7853651 -1.2155651
listx <- c("TCGA-DM-A28E-01","TCGA-A6-5657-01")
mxy
:
TCGA-AD-6963-01 TCGA-AA-3663-11 TCGA-AD-6901-01 TCGA-A6-A567-01
[1,] 1.0513004 1.2421004 1.5119004 1.6991004
[2,] -0.7592847 3.2265153 -0.8288847 -0.4752847
[3,] -0.4556149 -0.4556149 -0.4556149 -0.4556149
[4,] -0.3492225 0.1348775 -0.1155225 -0.3586225
[5,] -1.7248385 0.0427615 -1.5324385 -0.3399385
[6,] -0.8287651 -0.3504651 -0.5890651 -0.1925651
Okay. Given the information in your example I have put this together.
First, I just generated random numbers in place of yours (because I got lazy after copying sample
).
Secondly, because you are going to be saving this to a .csv file I changed the structure of the colnames
. You have -
as a seperator in your patient identifier, however, colnames
in R replace these with .
.
If you just used this same identifier from listx
, when creating the name for the .csv file you would end up with: TCGA.DM.A28E.01.csv. But how is your file system going to know if you want to save a .csv file, or another format called .A28E.01.csv.
This is why you have the gsub
lines after each data.frame.
gex <- data.frame("sample" = c(987,7829,15056,15058,15072),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
lapply(1:length(mxy), function(i){
lapply(1:length(listx), function(j){
indgene <- gex[listx[j]]
zvalues <- (indgene[i] - mean(mxy[,i])) / sd(mxy[,i])
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.csv(geneexptest, file = paste0(listx[j], ".csv"),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
})
})