Search code examples
rhistogramcurve-fittingnon-linear-regression

calculate gaussian curve fitting on a list


I have a list data like below. I want to perform nonlinear regression Gaussian curve fitting between mids and counts for each element of my list and report mean and standard deviation

mylist<- structure(list(A = structure(list(breaks = c(-10, -9, 
-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4), counts = c(1L, 
0L, 1L, 5L, 9L, 38L, 56L, 105L, 529L, 2858L, 17L, 2L, 0L, 2L), 
    density = c(0.000276014352746343, 0, 0.000276014352746343, 
    0.00138007176373171, 0.00248412917471709, 0.010488545404361, 
    0.0154568037537952, 0.028981507038366, 0.146011592602815, 
    0.788849020149048, 0.00469224399668783, 0.000552028705492686, 
    0, 0.000552028705492686), mids = c(-9.5, -8.5, -7.5, -6.5, 
    -5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5), 
    xname = "x", equidist = TRUE), .Names = c("breaks", "counts", 
"density", "mids", "xname", "equidist"), class = "histogram"), 
    B = structure(list(breaks = c(-7, -6, -5, 
    -4, -3, -2, -1, 0), counts = c(2L, 0L, 6L, 2L, 2L, 1L, 3L
    ), density = c(0.125, 0, 0.375, 0.125, 0.125, 0.0625, 0.1875
    ), mids = c(-6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5), xname = "x", 
        equidist = TRUE), .Names = c("breaks", "counts", "density", 
    "mids", "xname", "equidist"), class = "histogram"), C = structure(list(
        breaks = c(-7, -6, -5, -4, -3, -2, -1, 0, 1), counts = c(2L, 
        2L, 4L, 5L, 14L, 22L, 110L, 3L), density = c(0.0123456790123457, 
        0.0123456790123457, 0.0246913580246914, 0.0308641975308642, 
        0.0864197530864197, 0.135802469135802, 0.679012345679012, 
        0.0185185185185185), mids = c(-6.5, -5.5, -4.5, -3.5, 
        -2.5, -1.5, -0.5, 0.5), xname = "x", equidist = TRUE), .Names = c("breaks", 
    "counts", "density", "mids", "xname", "equidist"), class = "histogram")), .Names = c("A", 
"B", "C"))

I have read this Fitting a density curve to a histogram in R but this is how to fit a curve to a histogram. what I want is Best-fit values"

" Mean" " SD"

If I use PRISM to do it, I should get the following results for A

Mids   Counts
-9.5    1
-8.5    0
-7.5    1
-6.5    5
-5.5    9
-4.5    38
-3.5    56
-2.5    105
-1.5    529
-0.5    2858
0.5     17
1.5     2
2.5     0
3.5     2

performing nonlinear regression Gaussian curve fitting , I get

"Best-fit values"   
"     Amplitude"    3537
"     Mean"       -0.751
"     SD"         0.3842

for the second set B

Mids   Counts
-6.5    2
-5.5    0
-4.5    6
-3.5    2
-2.5    2
-1.5    1
-0.5    3



"Best-fit values"   
"     Amplitude"    7.672
"     Mean"         -4.2
"     SD"          0.4275

and for the third one

Mids   Counts
-6.5    2
-5.5    2
-4.5    4
-3.5    5
-2.5    14
-1.5    22
-0.5    110
0.5      3

I get this

"Best-fit values"   
"     Amplitude"    120.7
"     Mean"       -0.6893
"     SD"        0.4397

Solution

  • In order to convert the histogram back to the estimate of the mean and standard deviation. First convert the results of the bin counts times the bin. This will be an approximation of the original data.

    Based on your example above:

    #extract the mid points and create list of simulated data
    simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
    #if the original data were integers then this may give a better estimate
    #simdata<-lapply(mylist, function(x){rep(x$breaks[-1], x$counts)})
    
    #find the mean and sd of simulated data
    means<-lapply(simdata, mean)
    sds<-lapply(simdata, sd)
    #or use sapply in the above 2 lines depending on future process needs
    

    If your data was integers then using the breaks as the bins will provide a better estimate. Depending on the function for the histogram (ie right=TRUE/FALSE) may shift the results by one.

    Edit

    I thought this was going to be an easy one. I reviewed the video, the sample data shown was:

    mids<-seq(-7, 7)
    counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
    simdata<-rep(mids, counts)
    

    The video results were mean = -0.7359 and sd= 0.4571. The solution which I found provided the closest results was using the "fitdistrplus" package:

    fitdist(simdata, "norm", "mge")
    

    Using the "maximizing goodness-of-fit estimation" resulted in mean = -0.7597280 and sd= 0.8320465.
    At this point, the method above provides a close estimate but does not exactly match. I don't not know what technique was used to calculate the fit from the video.

    Edit #2

    The above solutions involved recreating the original data and fitting that using either the mean/sd or using the fitdistrplus package. This attempt is an attempt to perform a least-square fit using the Gaussian distribution.

    simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
    means<-sapply(simdata, mean)
    sds<-sapply(simdata, sd)
    
    #Data from video
    #mids<-seq(-7, 7)
    #counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
    
    #make list of the bins and distribution in each bin
    mids<-lapply(mylist, function(x){x$mids})
    dis<-lapply(mylist, function(x) {x$counts/sum(x$counts)})
    
    #function to perform the least square fit
    nnorm<-function(values, mids, dis) {
      means<-values[1]
      sds<-values[2]
      #print(paste(means, sds))
      #calculate out the Gaussian distribution for each bin
      modeld<-dnorm(mids, means, sds)  
      #sum of the squares
      diff<-sum( (modeld-dis)^2)
      diff
    }
    
    #use optim function with the mean and sd as initial guesses
    #find the mininium with the mean and SD as fit parameters
    lapply(1:3, function(i) {optim(c(means[[i]], sds[[i]]), nnorm, mids=mids[[i]], dis=dis[[i]])})
    

    This solution provides a closer answer to PRISM results, but still not the same. Here is a comparison of all the 4 solutions. enter image description here

    From the table, the least square fit (the one just above) provides the closest approximation. Maybe tweaking the mid points dnorm function might help. But Case B data is farthest from being normally distributed but the PRISM software still generates a small standard deviation, while the other methods are similar. It is possible the PRISM software performs some type of data filtering to remove the outliers before the fit.