Search code examples
rggplot2cdf

created a nested cdf that doesn't reach 1


Here is some workable example of data I wish to plot:

set.seed(123)

x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)

y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))

z <- rbinom(n=2000, size=1, prob=y)

df1 <- data.frame(x,z)

I want to plot the overal fequency of z across x.

unlike a typical cdf, the function should not reach 1.0, but instead

sum(df1$z)/length(df1$z)

a ymax of 0.36 (721/2000).

using ggplot2 we can create a cdf of x with the following command:

library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()

But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')

The end result should like like

enter image description here

EDIT

with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot

mytable <- table(df1$x, df1$z)

mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)

mydf$sum <- cumsum(mydf$z_yes)
mydf$dis  <- mydf$sum/length(z)

plot(mydf$A, mydf$dis)

Solution

  • You can use the package dplyr to process the data as follows:

    library(dplyr)
    plot_data <- group_by(df1, x) %>%
                  summarise(z_num = sum(z)) %>%
                  mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
    

    This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.

    For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():

    ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
      geom_step(colour = "red", size = 0.8) +
      geom_hline(yintercept = max(plot_data$cum_perc_z))
    

    enter image description here