Search code examples
rstatisticsnafitdistrplus

How to exclude NAs? (fitdist function)


I have 100x2 data frame DFN. Running fitdist on column DFN$Lret gives error message "function mle failed to estimate the parameters, with the error code 100". I figured the reason is the last row contains an NA. Hence I run fitdist excluding NAs, and now I get error "data must be a numeric vector of length greater than 1". Any thoughts on how to resolve this? Thanks very much.

DFN <- structure(list(LRet = c(0.0011, 0, -0.0026, 0, -0.0015, 0.0038, 3e-04, -0.0021, 4e-04, -0.001, 0, 0.0019, -6e-04, -8e-04, -5e-04, -8e-04, 3e-04, -5e-04, -0.0026, 0.0014, 7e-04, 0, -2e-04, 0.0011, -0.0025, 0.0042, 0.0022, -0.0017, -0.0058, 1e-04, 2e-04, 8e-04, -9e-04, -0.0014, -0.0014, -0.001, -0.0032, -0.0015, 6e-04, -8e-04, 0.001, -0.0014, -0.0017, -8e-04, -0.001, 0.0011, 0.0013, -0.001, 5e-04, 9e-04, -8e-04, -0.0025, 0.0027, 6e-04, 2e-04, -6e-04, 9e-04, -3e-04, -7e-04, 3e-04, 0, 2e-04, -6e-04, 1e-04, -1e-04, -7e-04, -8e-04, 7e-04, -1e-04, -7e-04, 7e-04, 8e-04, -8e-04, 8e-04, 0.0058, -1e-04, -5e-04, 0.0027, -0.0012, 7e-04, 7e-04, 0, 3e-04, -1e-04, 2e-04, -2e-04, -0.0013, -1e-04, 1e-04, -0.0011, 0.0013, 2e-04, -3e-04, -7e-04, 0, 0.0015, 1e-04, 3e-04, -0.0012, NA), LRetPct = c("0.11%", "0.00%", "-0.26%", "0.00%", "-0.15%", "0.38%", "0.03%", "-0.21%", "0.04%", "-0.10%", "0.00%", "0.19%", "-0.06%", "-0.08%", "-0.05%", "-0.08%", "0.03%", "-0.05%", "-0.26%", "0.14%", "0.07%", "0.00%", "-0.02%", "0.11%", "-0.25%", "0.42%", "0.22%", "-0.17%", "-0.58%", "0.01%", "0.02%", "0.08%", "-0.09%", "-0.14%", "-0.14%", "-0.10%", "-0.32%", "-0.15%", "0.06%", "-0.08%", "0.10%", "-0.14%", "-0.17%", "-0.08%", "-0.10%", "0.11%", "0.13%", "-0.10%", "0.05%", "0.09%", "-0.08%", "-0.25%", "0.27%", "0.06%", "0.02%", "-0.06%", "0.09%", "-0.03%", "-0.07%", "0.03%", "0.00%", "0.02%", "-0.06%", "0.01%", "-0.01%", "-0.07%", "-0.08%", "0.07%", "-0.01%", "-0.07%", "0.07%", "0.08%", "-0.08%", "0.08%", "0.58%", "-0.01%", "-0.05%", "0.27%", "-0.12%", "0.07%", "0.07%", "0.00%", "0.03%", "-0.01%", "0.02%", "-0.02%", "-0.13%", "-0.01%", "0.01%", "-0.11%", "0.13%", "0.02%", "-0.03%", "-0.07%", "0.00%", "0.15%", "0.01%", "0.03%", "-0.12%", " NA%")), .Names = c("LRet", "LRetPct"), class = "data.frame", row.names = 901:1000)

library(fitdistrplus)

#Following gives error code 100
f1 <- fitdist(DFN$LRet,"norm") 

#Following gives error code 100
f1 <- fitdist(DFN$LRet,"norm", na.rm=T)

#Following gives error data must be a numeric vector of length greater than 1"
f1 <- fitdist(na.exclude(DFN$LRet),"norm")
#Same result using na.omit

Please note if eliminating the last row, containing the NA, then the above code works fine. I would rather not have to eliminate the last row before running fitdist if can be avoided.

EDIT/UPDATE: eliminating the last row with the NA did solve the issue at first, but I am now failing to reproduce that consistently (i.e. have successfully run the code a few times after eliminating the last row, but not always). I am trying to understand why. I have tried using a 25x2 data frame, a 100x2, and a 300x2, as well as a vector, with similar results. Had thought the size of the data frame or vector may be part of the problem, hence the trials with different sizes.


Solution

  • (Also found the poorly written is.vector section of the code, but it didn't solve the errors.) The fitdist function seems to have difficulty with vectors of small variance:

    var( na.exclude(DFN$LRet))
    [1] 2.220427e-06
    

    You can get around that by multiplying by 10:

    > f1 <- fitdist(10*c(na.exclude(DFN$LRet)),"norm")
    > f1
    Fitting of the distribution ' norm ' by maximum likelihood 
    Parameters:
              estimate  Std. Error
    mean -0.0009090909 0.001490034
    sd    0.0148256472 0.001032122
    

    Standard probability theory lets you then correct those estimates: divide by 10 for the mean and by 100 for the variance (or 10 for the sd). The estimates from corrected fitdist-results are reasonably close to the sample values:

    > all.equal( 0.0148256472/10 , sd(na.exclude(DFN$LRet) ) )
    [1] "Mean relative difference: 0.005089095"