Search code examples
rbar-chartp-valuet-test

Multiple barplot along with t-test


I want a barplot based on the number of occurrences of a string in a particular column in a dataset in r.

At the same time, I want to run a t-test and plot the significant p-values using stars on the top of the bars. The nonsignificant can be represented as ns.

My attempt has been:

barplot(prop.table(table(ttcluster_dataset$Phenotype)),col=clustercolor,border="black",xlab="Phenotypes",ylab="Percentage of Samples expressed",main="Sample wise Phenotype distribution",cex.names = 0.8)

The dataset column is:

ttcluster_dataset$Phenotype<- 
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Proneural (Cluster 1)", "Proneural (Cluster 2)", "Neural (Cluster 1)", "Neural (Cluster 2)", 
"Classical (Cluster 1)", "Classical (Cluster 2)", "Mesenchymal (Cluster 1)", 
"Mesenchymal (Cluster 2)"), class = "factor")

All suggestions shall be apprciated.


Solution

  • A t-test is probably not what you want since you are looking at counts and proportions between the two clusters. Your data is not really set up to do either one so first we need to split the two variables:

    Pheno.splt <- strsplit(as.character(ttcluster_dataset$Phenotype), " ")
    Pheno.mat <- do.call(rbind, x)[, c(1, 3)]
    ttclust <- data.frame(Phenotype=Pheno.mat[, 1], Cluster=gsub(")", "", Pheno.mat[, 2]))
    str(ttclust)
    # 'data.frame': 171 obs. of  2 variables:
    #  $ Phenotype: chr  "Proneural" "Proneural" "Proneural" "Proneural" ...
    #  $ Cluster  : chr  "1" "1" "1" "1" ...
    

    Now Phenotype and Cluster are separate columns in the data frame. There are multiple ways to do this, but here we just split your Phenotype into three parts by splitting on the space between them. Now ttclust is as data frame with two variables. Now a summary table and bar plot:

    tbl <- xtabs(~Phenotype+Cluster, ttclust)
    tbl
    #              Cluster
    # Phenotype      1  2
    #   Classical   32  6
    #   Mesenchymal 44 10
    #   Neural      26  0
    #   Proneural   45  8
    tbl.row <- prop.table(tbl, 1)
    barplot(t(tbl.row), beside=TRUE)
    

    Barplot

    At this point, a simple proportions test indicates that there is no difference in percent of Cluster 1 across the four Phenotypes:

    prop.test(tbl)
    
    4-sample test for equality of proportions without continuity correction
    
    data:  tbl
    X-squared = 5.2908, df = 3, p-value = 0.1517
    alternative hypothesis: two.sided
    sample estimates:
       prop 1    prop 2    prop 3    prop 4 
    0.8421053 0.8148148 1.0000000 0.8490566 
    

    Using `prop.test' on each Phenotype indicates that Cluster 1 is significantly difference from Cluster 2 in every case:

    for(i in 1:4) print(prop.test(t(tbl[i, ])))
    
    # First test
    # 
    #   1-sample proportions test with continuity correction
    # 
    # data:  t(tbl[i, ]), null probability 0.5
    # X-squared = 16.447, df = 1, p-value = 5.002e-05
    # alternative hypothesis: true p is not equal to 0.5
    # 95 percent confidence interval:
    #  0.6807208 0.9341311
    # sample estimates:
    #         p 
    # 0.8421053 
        . . . .