Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2
we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr
to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796
and the maximal y value is thus 796/2000 = 0.398
.
For the plot, you can use geom_step()
to have a step function and add the horizontal line with geom_hline()
:
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))