Search code examples
rmatrix-multiplication

Dimension problem: Matrix multip. of ith entry in a vector multiplied as a scalar across part of matrix that is the ith entry in a list of matricies


I have a group of .csv files that contain matrices of identical size where the first row and first column are axis labels (they're fluorescence Excitation Emission Matrices, if anyone's familiar) that cannot be changed. The files are all named with the same pattern (eg. CaL_026-4x-1p0-20Jul23_EEM.csv , CaL_027-10x-1p0-20Jul23_EEM.csv) and the name contains the dilution factor that I need to use to multiply cells B2:AZ293 by.

I was gifted a partially-broken staRdom script that is supposed to read in .dat files, process them by removing noise and scattering, and then output a corrected .csv file to use for further data processing. One of the broken parts is the part that accounts for the dilution factor, meaning that my output file is sometimes full of values 20 times smaller than they should be. Ideally, I'd like the solution for this to be R reads the file name which is always in the same pattern as above, can pull out the number between the dash and the x (I have fixed this part but its clunky), and then use that number to multiply the cells B2:AZ293 of the corresponding matrix in the list ( eg. cells B2:AZ293 from Cal_026 from above example should be multipled by 4, but cells B2:AZ293 from Cal_027 should be multiplied by 10).

I tried several things for the number extraction, and settled on using one extraction for each order of magnitude (easy to add lines to include higher orders of magnitude if needed) and a line to deal with the background scan (which is technically a 1x dilution so that's what I did) and then converted from string to numeric:

#Read in EEM data! move folder name to working EEM folder
folder <- "path/subfolder" #accesses the EEM folder where data for specific instrument run is stored
eem_list <- eem_read("path/subfolder", recursive = FALSE, import_function = "aqualog") #reads EEMs in
#account for dilution factor corrections here!
dilution <- list.files("path/subfolder")
dilution<-str_replace(dilution, pattern = ".*-(.)x.*", replacement = "\\1")
dilution<-str_replace(dilution, pattern = ".*-(..)x.*", replacement = "\\1")
dilution<-str_replace(dilution, pattern = "MQblank.*", replacement = "1") 
dilution <-as.numeric(dilution)

eem_overview_plot(eem_list, spp=9, contour = TRUE) #plots EEM data

Now in theory eem_list should be multiply-able by the dilution vector BUT I don't know how to do that or how to constrain it to a subset of the cells? I need the first entry in the vector to multiply just cells B2:AZ293 (if it were open in excel) of the first matrix. . What I mean: mock before and after matrix of multiplying portion of matrix by dilution factor. A1:A293 and A1:AZ1 are unchanged but B2:AZ293 have been multiplied by 4

enter image description here

Tried to do this

dilution <-as.numeric(dilution)
#multiply file by dilution factor
eemlist <- for(i in 1:length(eem_list)){
  for(j in 1:length(dilution)){
    eem_list <-i[2:293,2:51]*j
  }
}

which spits out error "Error in i[2:293, 2:51] : incorrect number of dimensions" which I guess means I can't do just part of it? Or maybe I'm misunderstanding how this should work. Anyone have any ideas?

EDIT1: I found the solution to the dilution factor extraction problem as I was writing the question, so I apologize for the confusing code and thank you for your patience with me. I have attempted Phil's solution (thank you for the example! I had a much harder time finding similar problems with examples for this part) but I am still running into an incorrect number of dimensions error. Here is where I'm currently at:

folder <- "path/subfolder" #accesses the folder where example data is stored
eem_list <- eem_read("path/subfolder", recursive = FALSE, import_function = "aqualog") #reads EEMs in
#extract dilution factor from file name
dilution <- list.files("path/subfolder")
dilution<-str_replace(dilution, pattern = ".*-(..)x.*", replacement = "\\1")
dilution<-str_replace(dilution, pattern = ".*-(.)x.*", replacement = "\\1")
dilution<-str_replace(dilution, pattern = "MQblank.*", replacement = "1")
dilution <-as.numeric(dilution)
#multiply EEM by dilution factor
for (i in seq_along(eem_list)) {
  eem_list[[i]][2:293, 2:52] <- eem_list[[i]][2:293, 2:52] * dilution[i]
}

eem_overview_plot(eem_list, spp=9, contour = TRUE) 

New error reads "Error in eem_list[[i]][2:293, 2:52] : incorrect number of dimensions". I recounted the original file's dimensions, and it is actually 293 rows by 52 columns, so that's not the problem.

EDIT2: Chris's sanity check gives NULL dimension, which I think does explain why the matrix multiplication isn't working. I did more digging and realized that the input files are .dat files (oops) that are tab delimited, but every "column" is part of the same cell and there are two metadata rows between the first axis label row and the first row of the actual data (which is row 4). I tried two changes that didn't work. First I tried matching the actual range to use Phil's suggestion, but it still gave NULL dimension. Then I changed one of the .dat files using the Text-to-Columns transformation button in Excel to see if that would give it any dimension using Chris's suggestion and it didn't.

I think this means I have to relocate this dilution correction to after the export .csv files are generated. The only problem is that the files are exported after some peak picking files are generated. I don't know if relocating the export function will break the peak picking functions, but I am going to try it. I think my best bet is to export the uncorrected .csv files read them back in, do the dilution correction as previously planned, then re-export them with the second export function to overwrite the first export files.

New code:

#RELOCATED EEM EXPORT FUNCTION. 
setwd("C:/Users/peter/Downloads/JohnstonLab/Projects/CampusLakes/Duetta/ProcessedCaL/026-034TEST") #sets the folder you plan to export into
eem_export=function(eem){
  #extract data in the right format
  df=eem$x[,ncol(eem$x):1]
  colnames(df)=as.character(eem$ex)
  rownames(df)=as.character(eem$em)
  
  write.csv(df, file = paste(eem$sample,".csv",sep=""), quote = FALSE)
}

lapply(1:length(eem_list), function(i) eem_export(eem_list[[i]]) )

#Read in processed EEM data
folder <- "exportpath/subfolder"#accesses the folder where export data is stored
eem_list <- eem_read("exportpath/subfolder", recursive = TRUE, import_function = "aqualog") #reads EEMs in
eem_overview_plot(eem_list, spp=9, contour = TRUE)
lapply(eem_list, dim)

#extract dilution factor from file name
dilution <- list.files("exportpath/subfolder")
dilution<-str_replace(dilution, pattern = ".*-(..)x.*", replacement = "\\1")
dilution<-str_replace(dilution, pattern = ".*-(.)x.*", replacement = "\\1")
#dilution<-str_replace(dilution, pattern = "MQblank.*", replacement = "1") #relocation means blank doesn't need to be accounted for
dilution <-as.numeric(dilution)
#multiply EEM by dilution factor
for (i in seq_along(eem_list)) {
  eem_list[[i]][2:293, 2:52] <- eem_list[[i]][2:293, 2:52] * dilution[i]
}

#EXPORT AGAIN!!!! This time DF is accounted for
setwd("exportfolder/subfolder") #sets the folder you plan to export into
eem_export=function(eem){
  
  #extract data in the right format
  df=eem$x[,ncol(eem$x):1]
  colnames(df)=as.character(eem$ex)
  rownames(df)=as.character(eem$em)
  
  write.csv(df, file = paste(eem$sample,".csv",sep=""), quote = FALSE)
}

lapply(1:length(eem_list), function(i) eem_export(eem_list[[i]]) )

The new code works up until the matrix algebra, so I haven't gotten to test whether or not the peak picking functions are broken yet. That said, running lapply(eem_list, dim) once again gives NULL dimension so I'm not surprise the matrix algebra still isn't working. I've cracked open the .csv export files, they look like they should, so I am once again stuck. Error still reads "Error in eem_list[[i]][2:293, 2:52] : incorrect number of dimensions". Is it possibly an issue that the first cell (A1) is blank?

EDIT3: Running str(eem_list) prints:

str(eem_list)
List of 9
 $ :List of 6
  ..$ file    : chr "C:/Users/peter/Downloads/JohnstonLab/Projects/CampusLakes/Duetta/ProcessedCaL/026-034TEST/CaL_026-4x-1p0-20Jul23_EEM.csv"
  ..$ sample  : chr "CaL_026-4x-1p0-20Jul23_EEM"
  ..$ x       : num [1:292, 1:51] 0.0878 0.083 0.0781 0.0733 0.0686 ...
  ..$ ex      : num [1:51] 250 255 260 265 270 275 280 285 290 295 ...
  ..$ em      : num [1:292] 248 250 252 254 256 ...
  ..$ location: chr "C:/Users/peter/Downloads/JohnstonLab/Projects/CampusLakes/Duetta/ProcessedCaL/026-034TEST"
  ..- attr(*, "class")= chr "eem"
  ..- attr(*, "is_blank_corrected")= logi FALSE
  ..- attr(*, "is_scatter_corrected")= logi FALSE
  ..- attr(*, "is_ife_corrected")= logi FALSE
  ..- attr(*, "is_raman_normalized")= logi FALSE
 $ :List of 6
  ..$ file    : chr "C:/Users/peter/Downloads/JohnstonLab/Projects/CampusLakes/Duetta/ProcessedCaL/026-034TEST/CaL_027-10x-1p0-20Jul23_EEM.csv"
  ..$ sample  : chr "CaL_027-10x-1p0-20Jul23_EEM"
  ..$ x       : num [1:292, 1:51] 0.155 0.142 0.13 0.118 0.106 ...
  ..$ ex      : num [1:51] 250 255 260 265 270 275 280 285 290 295 ...
  ..$ em      : num [1:292] 248 250 252 254 256 ...
  ..$ location: chr "C:/Users/peter/Downloads/JohnstonLab/Projects/CampusLakes/Duetta/ProcessedCaL/026-034TEST"
  ..- attr(*, "class")= chr "eem"
  ..- attr(*, "is_blank_corrected")= logi FALSE
  ..- attr(*, "is_scatter_corrected")= logi FALSE
  ..- attr(*, "is_ife_corrected")= logi FALSE
  ..- attr(*, "is_raman_normalized")= logi FALSE

It looks like when I read them back in, it is breaking them into the component pieces that the export function used to build them (which seems to be a built in feature of the staRdom eem_read function). The variable I'm trying to correct with the dilution factor is x, and the variables I am trying to leave alone are ex and em. When I open the files in excel, they open with cell A1 blank, A2:A293 filled with the values of em, B1:AZ1 filled with the values of ex, and B2:AZ293 filled with x. I'm assuming this means I need to figure out how to have R multiply each x in eem_list instead of eem_list?

EDIT4: I tried swapping out the matrix dimensions for "x"

for (i in seq_along(eem_list)) {
  eem_list[[i]][['x']] <- eem_list[[i]][['x']] * dilution[i]
}

and now it's working!! And the peak picking functions aren't broken either. Thanks, Phil!


Solution

  • This is a partial answer, as I don't believe it will solve your problem. The code as you have it is problematic in several points.

    The part that creates the dilution object references a x object, but x is not defined anywhere, so it's not possible to tell whether it would work (I assume it's because the code that defines it is not shared in your question). I am assuming that it does, and that the dilution object is the same length as the eemlist object.

    The for loop doesn't make sense - you don't assign it to an object. Rather you make assignments to the ith element of the object doing something like

    for (i in 1:length(myobj)) {
      myobj[i] <- some_function(i)
    }
    

    Moreover, you are getting this error because i represents a numeric value that is iterating through from 1 to the length of the object. so i[2:293,2:51] makes no sense - there is no such range with the scalar value 1. I think you meant eemlist[[i]][2:293,2:51] if eemlist is meant to be a list of matrices.

    The following bit of code exemplifies what I think you're trying to achieve. The first bit is just to make the example reproducible, and is not meant to be used in your actual situation (the matrix will only consist of 1s to make the output easy to see):

    mymx_list <- list()
    
    for (i in 1:5) {
      mymx_list[[i]] <- matrix(1, nrow = 392, ncol = 51)
    }
    
    dilution <- 5:9
    

    The following will correspondingly multiply each element of the dilution vector to each of the matrices in mymx_list.

    for (i in seq_along(mymx_list)) {
      mymx_list[[i]][2:392, 2:51] <- mymx_list[[i]][2:392, 2:51] * dilution[i]
    }
    
    # To see the result
    mymx_list
    

    I'm using seq_along(x) here because it is safer than 1:length(x).