Search code examples
rrna-seq

Is there a way to calculate the Z score for all values in a row in a data frame?


I have a data frame which contains expression levels of a gene in 1677 conditions. I am looking to create a new data frame which has the Z score for each condition. This is the code I have so far:

for (cell_no in 1:ncol(NANOG_data)) {
  z_score[cell_no] <- (NANOG_data[2, cell_no] - rowMeans(NANOG_data)) / rowSds(as.matrix(NANOG_data))}

And this is what the data frame looks like.

When I run this code, I get this error:

Error: object 'z_score' not found.

Is there a way to more easily populate a new data frame using a for loop, or is there a vectorized function I can run on my original data frame to calculate the Z score for each value?


Solution

  • As @GuedesBF commented, posting a screenshot of data is bad practise, and you should avoid that (ref https://xkcd.com/2116/).

    I will try to help you with a dummy dataset:

    #let's first generate a matrix
    set.seed(999)
    my_dummy_data <- matrix(rnorm(length(letters)), nrow=1, dimnames=list(1,letters))
    
    >my_dummy_data 
               a        b        c         d          e          f         g
    1 -0.2817402 -1.31256 0.795184 0.2700705 -0.2773064 -0.5660237 -1.878658
              h          i         j        k         l         m         n
    1 -1.266791 -0.9677497 -1.121009 1.325464 0.1339774 0.9387494 0.1725381
              o         p          q         r         s         t         u
    1 0.9576504 -1.362686 0.06833513 0.1006576 0.9013448 -2.074357 -1.228563
              v          w         x         y         z
    1 0.6430443 -0.3597629 0.2940356 -1.125268 0.6422657
    

    As far as I understand, this is the same structure as your data: column names are genes (e.g. "AAACCCTG..."), and the numerical values are "expressions". (not a geneticist, so apologies if I get the terminology wrong :)).

    Now, I assume that you want to generate a new vector where the expression values are transformed into z-scores by subtracting the mean and dividing by standard error. That can be done by:

    my_z_scores <-( my_dummy_data-mean(my_dummy_data) ) / sd(my_dummy_data)
    

    Going beyond your actual question, before doing any further analysis, you might want to transform your data into a columnar form:

    my_better_dummy_data <- data.frame(gene=colnames(my_dummy_data), expression=as.vector(my_dummy_data) )
    

    In columnar form, the z-scores could be calculated as

    my_better_dummy_data$z_score <- (my_better_dummy_data$expression - mean(my_better_dummy_data$expression) / sd(my_better_dummy_data$expression)