Search code examples
rdistributionprobability-densityprobability-distribution

Cumulative Distribution Function from input histogram


I would like to build the Cumulative Distribution Function (CDF) from an input file that contains the data to generate a histogram. The input file has one column per bin and one column with the amount of ocurrences inside each bin, so it looks like this:

bin     column6
0       1189
5       11957
10      24203
15      21518
20      14515
25      10323
30      7799
35      6015
40      4869
45      3858
50      3215
55      2615
60      2350
65      1890
70      1673
75      1433
80      1218
85      942
90      869
95      736
100     605
105     528
110     449
115     429
120     327
125     252
130     208
135     170
140     154
145     138
150     124
155     86
160     113
165     108
170     71
175     72
180     51
185     58
190     37
195     29
200     35
205     24
210     11
215     24
220     16
225     20
230     15
235     5
240     11
245     4
250     4
255     6
260     6
265     6
270     4
275     3
280     4
285     2
290     3
295     1
300     5
305     3
310     2
315     1
320     1
325     2
330     0
335     1
340     2
345     0
350     0
355     2
360     4
365     2
370     0
375     1
380     1
385     2
390     0
395     1
400     1
405     1

I use R to visualize the histogram using the following code:

library(ggplot2)

input <- read.table('/home/agalvez/data/domains/histo_leu.txt', sep="\t", header=TRUE)

histo <- ggplot(data=input, aes(x=input$bin, y=input$column6)) +
  geom_bar(stat="identity")
 
histo

Could someone give me some advice on how to build the CDF for this histogram? Thanks in advance!


Solution

  • Bit unclear question, I assume you are looking for the eCDF since any parametric CDF generally has an analytical formula.

    In R, you can use ecdf to generate an eCDF.

    library(purrr)
    library(tidyr)
    library(dplyr)
    library(ggplot2)
    input <- input %>%
        filter(column6 != 0) %>%
        mutate(
            column6 = map(column6, ~1:.x)
        ) %>%
        unnest(column6)
    # Make the ecdf
     input %$%
        ecdf(bin)
    # To plot use stat_ecdf
    input %>%
        ggplot(aes(bin))+
        stat_ecdf(geom = "step")