Fitting exponential distribution to frequency table

I have the following data set:

intervals <- c("0-10", "10-20", "20-30", "30-40", "40-50", "50-75", "75-100", ">100")
int.mean <- c(5.5, 14.3, 24.9, 35.4, 45.2, 63.1, 86.1, NA)
freq <- c(165, 90, 55, 25, 20, 35, 30, 15)

data <- data.frame(intervals, int.mean, freq)

I am looking to fit an exponential distribution to the data to predict the probability a value exceed 150 with a certain degree of confidence. I can fit the distribution as follows:

library(MASS)
fittedexp <- fitdistr(na.exclude(data$int.mean), "exponential")

However that doesn't take into account frequencies so I am not sure I am doing this correctly. I then plan to use to the optim function to create the confidence interval for the estimated probability.

Solution

You are dealing with a categorical variable, "intervals", which creates a discrete observation of counts based on a presumed underlying continuous variable from which you have taken breakpoints. Kind of messy data situation. Technically you have interval-censored data. However if you have exponential distributions as an assumption, those "means" you calculated are actually midpoints, but they would not be expected to be the means of an exponentially distributed variable. See below for my revised comments about the int.means observations. (So now I'll expand my original comment to include some R code.)

If we take the endpoints of your intervals as a breaks variable, and also calculate the proportions in the observed data we have:

 brks <- c(0, 10,20,30,40,50,75,100,Inf)
 freq <- c(165, 90, 55, 25, 20, 35, 30, 15)
 prop<- freq/sum(freq)
 prop
#-----
[1] 0.37931034 0.20689655 0.12643678 0.05747126 0.04597701 0.08045977 0.06896552 0.03448276
round(prop,2)
[1] 0.38 0.21 0.13 0.06 0.05 0.08 0.07 0.03

We can then show what an exponentially distributed variable with a similar mean might "look like" (as far as proportions) if binned into those intervals:

 table( findInterval( rexp(100, 1/15), brks) )/100

   1    2    3    4    5    6    7 
0.47 0.24 0.12 0.08 0.04 0.04 0.01

So we might want to try a mean that is higher than 15, say 20?

> table( findInterval( rexp(100, 1/20), brks) )/100

   1    2    3    4    5    6    7    8 
0.37 0.24 0.13 0.09 0.07 0.07 0.02 0.01 
> round(prop,2)
[1] 0.38 0.21 0.13 0.06 0.05 0.08 0.07 0.03

So you can fit the low end of the observations well, but an exponentially distributed variable seems to have a somewhat "thinner" tail. Since your interest is in the high end of the data, you may want to get a better fit at the higher end, but this will mess with your goal of a statistically principled confidence interval. You're kind of stuck, since your data isn't a properly "exponential" set of observations. (Increased size of simulation to 1000 to reduce impact of noise.)

> table( findInterval( rexp(1000, 1/25), brks) )/1000

    1     2     3     4     5     6     7     8 
0.329 0.222 0.141 0.103 0.056 0.094 0.034 0.021 
> round(prop,2)
[1] 0.38 0.21 0.13 0.06 0.05 0.08 0.07 0.03

The fit there doesn't look terrible. If the rate parameter of an exponential distribution were 1/25, then this would be the proportion of observations greater than 150:

 1-pexp(150, 1/25)
#[1] 0.002478752

Possibly useful: http://jsdajournal.springeropen.com/articles/10.1186/s40488-015-0028-6

You also can try searching on CrossValidated.com where some prior discussions exist.

Edit: I originally thought those int.means values were midpoints of the interval boundaries, but that's clearly not the case, since they seem to be close to what would be the midpoints but have a significant amount of jitter around the midpoints. Furthermore, those values are not consistent with an exponential distribution, since in the most populous interval (0-10) the observations should be to the "left" of the midpoint and it is not even on the left hand of the midpoint. It should probably be 4.0 or 4.5, but surely not as high as 5.5. It suggests some other distribution is underlying this physical process, perhaps some sort of Gamma distribution that would fall to zero near zero but peak early in the 0-10 interval and then have a longer tail.