Search code examples
rnon-linear-regressionmixture-modelgmm

Data is too long Error in R FlexmixNL package


I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:

n = 10000
x1 <- runif(n,0,100) 
x2 <- runif(n,0,100) 
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)

modelNL <- flexmix(y~x, k =2, 
                   model = FLXMRnlm(formula = y ~ a*x/(b+x), 
                                    family = "gaussian", 
                                    start = list(start1, start2))) 

plot(x, y, col = clusters(modelNL))

and before the plot, it gives me this error:

Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long

I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.

As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.


Solution

  • Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.

    > modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
    +                    model = FLXMRnlm(formula = y ~ a*x/(b+x), 
    +                                     family = "gaussian", 
    +                                     start = list(start1, start2)))
    > modelNL
    
    Call:
    flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
        a * x/(b + x), family = "gaussian", start = list(start1, start2)))
    
    Cluster sizes:
        1     2 
     6664 13336 
    
    convergence after 20 iterations
    > summary(modelNL)
    
    Call:
    flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
        a * x/(b + x), family = "gaussian", start = list(start1, start2)))
    
           prior  size post>0 ratio
    Comp.1 0.436  6664  20000 0.333
    Comp.2 0.564 13336  16306 0.818
    
    'log Lik.' -91417.03 (df=7)
    AIC: 182848.1   BIC: 182903.4 
    

    Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.

    I tried a formula suggested by the plot of the data and get convergent results:

    > modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
    +                    model = FLXMRnlm(formula = y ~ a*x*cos(x+b), 
    +                                     family = "gaussian", 
    +                                     start = list(start1, start2)))
    > modelNL
    
    Call:
    flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
        a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
    
    Cluster sizes:
        1     2 
     9395 10605 
    
    convergence after 17 iterations
    > summary(modelNL)
    
    Call:
    flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
        a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
    
           prior  size post>0 ratio
    Comp.1 0.521  9395  18009 0.522
    Comp.2 0.479 10605  13378 0.793
    
    'log Lik.' -78659.85 (df=7)
    AIC: 157333.7   BIC: 157389 
    

    The reduction in AIC seems huge compare to the first formula.