Search code examples
rdataframematrixp-valuedensity-plot

A general solution to analyze and plot two data frames with varying lengths?


Could you please help me?

I'm writing a code in R to automatize a null model analysis of multiple networks. First, the code reads multiple TXT matrices into R. Second, it calculates a topological metric for each network. Third, it randomizes each network N times using a null model. Fourth, it calculates the same topological metric for all randomized versions of the original matrices.

In the fifth and final step, the idea is to compare the observed scores against the distributions of randomized scores. First, by doing a simple count of how many randomized scores are above or below the observed score, in order to estimate the P-values. Second, by plotting the distribution of randomized scores as a density and adding a vertical line to show the observed score.

Here are examples of the data frames that need to be analyzed:

networks <- paste("network", rep(1:3), sep = "")
randomizations <- seq(1:10)

observed.ex <- data.frame(network = networks,
                          observed = runif(3, min = 0, max = 1))

randomized.ex <- data.frame(network = sort(rep(networks, 10)),
                            randomization = rep(randomizations, 3),
                            randomized = rnorm(length(networks)*
                                                   length(randomizations),
                                               mean = 0.5, sd = 0.1))

In the first step of the final analysis, the code estimates the P-values by doing simple counts. As you see, I need to make copies of the calculation call for each network:

randomized.network1 <- subset(randomized.ex, network == "network1")
sum(randomized.network1$randomized >= observed.ex$observed[1]) /
    length(randomized.network1$randomized)
sum(randomized.network1$randomized <= observed.ex$observed[1]) /
    length(randomized.network1$randomized)

randomized.network2 <- subset(randomized.ex, network == "network2")
sum(randomized.network2$randomized >= observed.ex$observed[2]) /
    length(randomized.network2$randomized)
sum(randomized.network2$randomized <= observed.ex$observed[2]) /
    length(randomized.network2$randomized)

randomized.network3 <- subset(randomized.ex, network == "network3")
sum(randomized.network3$randomized >= observed.ex$observed[3]) /
    length(randomized.network3$randomized)
sum(randomized.network3$randomized <= observed.ex$observed[3]) /
    length(randomized.network3$randomized)

In the second step of the final analysis, the code makes density plots. As you see, I need to make copies of the vertical line call for each network:

ggplot(randomized.ex, aes(randomized)) +
    geom_density() +
    facet_grid(network~.) +
    geom_vline(data=filter(randomized.ex, network == "network1"),
               aes(xintercept = observed.ex$observed[1]), colour = "red") + 
    geom_vline(data=filter(randomized.ex, network == "network2"),
               aes(xintercept = observed.ex$observed[2]), colour = "red") + 
    geom_vline(data=filter(randomized.ex, network == "network3"),
               aes(xintercept = observed.ex$observed[3]), colour = "red") 

Is there a way to make this final analysis more general, so it always does the same calculations and plots, no matter how many networks are read in the beginning?

Thank you very much!


Solution

  • It looks like this can be neatly wrapped in an lapply loop that iterates over each file. How does the below work for you? You could also pass in filenames rather than the number of files (currently 1:3) and have the first line "read" in your TXT matrices.

    library(dplyr) #For %>%, group_by, and summarize
    output <- lapply(1:3, function(network_num){
      network <- paste0("network", network_num)
      n_randomizations <- 10
      observed.ex <- runif(1)
      randomized.ex <- rnorm(n_randomizations, mean = 0.5, sd = 0.1)
    
      return(data.frame(network=network, observed=observed.ex, randomized=randomized.ex))
    }) %>% do.call(what = rbind)
    
    output %>%
      group_by(network) %>%
      summarize(p_value=mean(observed>=randomized))
    
    ggplot(output) +
      geom_density(aes(randomized)) +
      facet_grid(network~.) +
      geom_vline(aes(xintercept = observed), col="red")
    

    plots