I want a barplot based on the number of occurrences of a string in a particular column in a dataset in r.
At the same time, I want to run a t-test and plot the significant p-values using stars on the top of the bars. The nonsignificant can be represented as ns.
My attempt has been:
barplot(prop.table(table(ttcluster_dataset$Phenotype)),col=clustercolor,border="black",xlab="Phenotypes",ylab="Percentage of Samples expressed",main="Sample wise Phenotype distribution",cex.names = 0.8)
The dataset column is:
ttcluster_dataset$Phenotype<-
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Proneural (Cluster 1)", "Proneural (Cluster 2)", "Neural (Cluster 1)", "Neural (Cluster 2)",
"Classical (Cluster 1)", "Classical (Cluster 2)", "Mesenchymal (Cluster 1)",
"Mesenchymal (Cluster 2)"), class = "factor")
All suggestions shall be apprciated.
A t-test is probably not what you want since you are looking at counts and proportions between the two clusters. Your data is not really set up to do either one so first we need to split the two variables:
Pheno.splt <- strsplit(as.character(ttcluster_dataset$Phenotype), " ")
Pheno.mat <- do.call(rbind, x)[, c(1, 3)]
ttclust <- data.frame(Phenotype=Pheno.mat[, 1], Cluster=gsub(")", "", Pheno.mat[, 2]))
str(ttclust)
# 'data.frame': 171 obs. of 2 variables:
# $ Phenotype: chr "Proneural" "Proneural" "Proneural" "Proneural" ...
# $ Cluster : chr "1" "1" "1" "1" ...
Now Phenotype and Cluster are separate columns in the data frame. There are multiple ways to do this, but here we just split your Phenotype
into three parts by splitting on the space between them. Now ttclust
is as data frame with two variables. Now a summary table and bar plot:
tbl <- xtabs(~Phenotype+Cluster, ttclust)
tbl
# Cluster
# Phenotype 1 2
# Classical 32 6
# Mesenchymal 44 10
# Neural 26 0
# Proneural 45 8
tbl.row <- prop.table(tbl, 1)
barplot(t(tbl.row), beside=TRUE)
At this point, a simple proportions test indicates that there is no difference in percent of Cluster 1 across the four Phenotypes:
prop.test(tbl)
4-sample test for equality of proportions without continuity correction
data: tbl
X-squared = 5.2908, df = 3, p-value = 0.1517
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.8421053 0.8148148 1.0000000 0.8490566
Using `prop.test' on each Phenotype indicates that Cluster 1 is significantly difference from Cluster 2 in every case:
for(i in 1:4) print(prop.test(t(tbl[i, ])))
# First test
#
# 1-sample proportions test with continuity correction
#
# data: t(tbl[i, ]), null probability 0.5
# X-squared = 16.447, df = 1, p-value = 5.002e-05
# alternative hypothesis: true p is not equal to 0.5
# 95 percent confidence interval:
# 0.6807208 0.9341311
# sample estimates:
# p
# 0.8421053
. . . .