I would like to build the Cumulative Distribution Function (CDF) from an input file that contains the data to generate a histogram. The input file has one column per bin and one column with the amount of ocurrences inside each bin, so it looks like this:
bin column6
0 1189
5 11957
10 24203
15 21518
20 14515
25 10323
30 7799
35 6015
40 4869
45 3858
50 3215
55 2615
60 2350
65 1890
70 1673
75 1433
80 1218
85 942
90 869
95 736
100 605
105 528
110 449
115 429
120 327
125 252
130 208
135 170
140 154
145 138
150 124
155 86
160 113
165 108
170 71
175 72
180 51
185 58
190 37
195 29
200 35
205 24
210 11
215 24
220 16
225 20
230 15
235 5
240 11
245 4
250 4
255 6
260 6
265 6
270 4
275 3
280 4
285 2
290 3
295 1
300 5
305 3
310 2
315 1
320 1
325 2
330 0
335 1
340 2
345 0
350 0
355 2
360 4
365 2
370 0
375 1
380 1
385 2
390 0
395 1
400 1
405 1
I use R to visualize the histogram using the following code:
library(ggplot2)
input <- read.table('/home/agalvez/data/domains/histo_leu.txt', sep="\t", header=TRUE)
histo <- ggplot(data=input, aes(x=input$bin, y=input$column6)) +
geom_bar(stat="identity")
histo
Could someone give me some advice on how to build the CDF for this histogram? Thanks in advance!
Bit unclear question, I assume you are looking for the eCDF since any parametric CDF generally has an analytical formula.
In R, you can use ecdf
to generate an eCDF.
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)
input <- input %>%
filter(column6 != 0) %>%
mutate(
column6 = map(column6, ~1:.x)
) %>%
unnest(column6)
# Make the ecdf
input %$%
ecdf(bin)
# To plot use stat_ecdf
input %>%
ggplot(aes(bin))+
stat_ecdf(geom = "step")