Search code examples
rmeandistributionmedianvariance

How can I draw a data variance distribution among multiple datasets in R?


I am having three biomedical datasets (1 binary matrix, 1 continuous matrix and 1 discrete matrix). Right now, I want to draw a data (either variance or median or mean) distribution plot including the three in one figure, and then compute skewness and P-value based on the D’Agostino test among three datasets. Specifically, in each distribution curve, the x-axis indicates the (either variance or mean or median) of genes, while the y-axis indicates frequent or density of genes across samples.

The below figure is similar to the result I want.

enter image description here

And here is the reproducible datasets.

-df1:

df1 = structure(c(-0.056, -0.056, -0.056, -0.056, -0.056, -0.1388, 
              -0.1388, -0.1388, -0.1388, -0.1388, -0.0592, -0.0592, -0.0592, 
              -0.0592, -0.0592, -0.0646, -0.0646, -0.0646, -0.0646, -0.0646, 
              -0.1669, -0.1669, -0.1669, -0.1669, -0.1669), .Dim = c(5L, 5L
              ), .Dimnames = list(c("TCGA-4H-AAAK-01", "TCGA-5L-AAT0-01", "TCGA-5T-A9QA-01", 
                                    "TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"), c("TBC1D21", "FGF4", "KRTAP9-4", 
                                                                             "PSG11", "ADAM5")))

-df2:

df2 = structure(c(0L, 0L, 2L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 2L, 
                  0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 2L, 0L, 0L), .Dim = c(5L, 
                                                                            5L), .Dimnames = list(c("TCGA-4H-AAAK-01", "TCGA-5L-AAT0-01", 
                                                                                                    "TCGA-5T-A9QA-01", "TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"), c("GPR124", 
                                                                                                                                                                "ERLIN2", "LOC728024", "PROSC", "KCNU1")))

-df 3:

df3 = structure(c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
                  0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), .Dim = c(5L, 
                                                                            5L), .Dimnames = list(c("TCGA-4H-AAAK-01", "TCGA-5L-AAT0-01", 
                                                                                                    "TCGA-5T-A9QA-01", "TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"), c("PIK3CA", 
                                                                                                                                                                "TP53", "TTN", "MUC16", "CDH1")))

I have been actively searching on the web, but nothing is useful for my wish. Any helps would be appreciated. Thanks in advance.

The first step I think is merging my three datasets into one:

MYdata = do.call("rbind", list(t(df1), t(df2),t(df3)))

Then, I will compute variance of three datasets:

MYdata = var(MYdata)

Finally, I have to plot them by using ggplot2 (I think) but It is so complicated for the new R-user like me.


Solution

  • From my understanding, you have three datasets and you would like to plot into a single graph, the density of values in each of these datasets with a vertical lines representing either the Mean, the median or the variance. Am I right ?

    A possible solution will be to merge datasets but AFTER having reshape them into a longer format (using pivot_longer function from tidyr package for example) and adding a column naming different datasets:

    With your example, it can be:

    library(tidyr)
    library(dplyr)
    DF1 <- as.data.frame(df1) %>% mutate(Patients = rownames(df1)) %>% 
      pivot_longer(-Patients, names_to = "Genes",values_to = "Values") %>%
      mutate(Dataset = "DF1")
    
    # A tibble: 25 x 4
       Patients        Genes     Values Dataset
       <chr>           <chr>      <dbl> <chr>  
     1 TCGA-4H-AAAK-01 TBC1D21  -0.056  DF1    
     2 TCGA-4H-AAAK-01 FGF4     -0.139  DF1    
     3 TCGA-4H-AAAK-01 KRTAP9-4 -0.0592 DF1    
     4 TCGA-4H-AAAK-01 PSG11    -0.0646 DF1    
     5 TCGA-4H-AAAK-01 ADAM5    -0.167  DF1    
     6 TCGA-5L-AAT0-01 TBC1D21  -0.056  DF1    
     7 TCGA-5L-AAT0-01 FGF4     -0.139  DF1    
     8 TCGA-5L-AAT0-01 KRTAP9-4 -0.0592 DF1    
     9 TCGA-5L-AAT0-01 PSG11    -0.0646 DF1    
    10 TCGA-5L-AAT0-01 ADAM5    -0.167  DF1    
    # … with 15 more rows
    

    Now, you are doing the same thing for df2 and df3 and we are adding all rows together:

    library(tidyr)
    library(dplyr)
    DF2 <- as.data.frame(df2) %>% mutate(Patients = rownames(df2)) %>% 
      pivot_longer(-Patients, names_to = "Genes",values_to = "Values") %>%
      mutate(Dataset = "DF2")
    
    DF3 <- as.data.frame(df3) %>% mutate(Patients = rownames(df3)) %>% 
      pivot_longer(-Patients, names_to = "Genes",values_to = "Values") %>%
      mutate(Dataset = "DF3")
    
    DF <- bind_rows(DF1,DF2,DF3)
    

    Now, we are creating a second dataframe containing the mean, median and variance per dataset:

    library(dplyr)
    DF_mean <- DF %>% group_by(Dataset) %>% 
      summarise(Mean = mean(Values),
                Median = median(Values),
                Var = var(Values))
    

    Finally, we can use those two datasets t plot the density of each datasets and add a vertical line corresponding to the mean of each dataset:

    library(tidyr)
    library(dplyr)
    library(ggplot2)
    
    ggplot(DF,aes(x = Values, fill = Dataset))+
      geom_density(alpha = 0.6)+
      geom_vline(inherit.aes = FALSE, 
                 data = DF_mean, aes(xintercept = Mean, color = Dataset),
                 linetype = "dashed", size = 2,
                 show.legend = FALSE)
    

    enter image description here

    Does it answer your question ?