Search code examples
rmlogitnnet

Implementing mlogit with non 'choice' (land cover) data


I am trying to implement a multinomial logistic regression using mlogit with landcover change data. However, because mlogit seems to be built around “choice” data and examples are limited to such data, I’m having a hard time determining the feasibility of using this R package for my analysis. I am confident that a multinominal logistic regression is the statistical method suited for the dataset and question I am addressing, just not sure if mlogit has the capacity to work with the data.

Let us use this landcover change dataset as an example, structured similarly to my own:

dat.url<-'http://www.css.cornell.edu/faculty/dgr2/_static/files/R_ds/lcc.csv'
data<-read.csv(dat.url)
head(data)
  cov     dr   ds  t lspos text
1  CC 2924.3 4853 Nt     A   LC
2  CC 2535.2 5242 Nt     A   LC
3  CO 2146.1 4957 Ci     A   LC
4  NO 2442.3 4690 Ci     A   LC
5  NC 2831.4 5079 Ci     A   LC
6  OO 1576.9 4283 Nt     A   LC

The cov variable is a composite code that represents how the landcover has changed between two time periods: the first character is the landcover class at the initial time of measurement, and the second character is the landcover class at the end time. For example, CC represents a sample point where the landcover did not change – it was closed canopy forest in both time periods. Alternatively, CO represents a point where the landcover shifted from closed canopy in the initial time period to open canopy in the end time. The remaining variables are predictors of landcover change: dr – distance to road, ds – distance to settlement, t – land tenure class, lspos – landscape position class, text – soil texture class.

The concern that I have with mlogit is that this isn’t “choice” data in the way that most examples are. So, if I try to follow this exercise published by Train & Croissant, converting the data to mlogit’s long format results in this:

dat.mlog<-dfidx(data, varying=NULL, shape="wide", choice="cov")
head(dat.mlog)
~~~~~~~
 first 10 observations out of 9576 
~~~~~~~
     cov     dr   ds  t lspos text  idx
1   TRUE 2924.3 4853 Nt     A   LC 1:CC
2  FALSE 2924.3 4853 Nt     A   LC 1:CN
3  FALSE 2924.3 4853 Nt     A   LC 1:CO
4  FALSE 2924.3 4853 Nt     A   LC 1:NC
5  FALSE 2924.3 4853 Nt     A   LC 1:NN
6  FALSE 2924.3 4853 Nt     A   LC 1:NO
7  FALSE 2924.3 4853 Nt     A   LC 1:OC
8  FALSE 2924.3 4853 Nt     A   LC 1:ON
9  FALSE 2924.3 4853 Nt     A   LC 1:OO
10  TRUE 2535.2 5242 Nt     A   LC 2:CC

~~~ indexes ~~~~
   id1 id2
1    1  CC
2    1  CN
3    1  CO
4    1  NC
5    1  NN
6    1  NO
7    1  OC
8    1  ON
9    1  OO
10   2  CC
indexes:  1, 2 

So, it results in the same values for the predictor variables across all idx categories for each sample point – with the way it is formatted with dfidx(), my interpretation is that for each individual point the outcome of my predictors is the same for every landcover transition. This seems like it may be introducing problems into the analysis of this data. Although I haven’t been able to run the mlogit() model itself on this example dataset (another separate problem), when I run it on my own data the result is p-values = 1 for every predictor – seems a bit suspicious. Does anyone have insight on whether this is indeed the correct implementation of mlogit and dfidx() for this type of landcover data?

Note that there are a few published papers using mlogit with landcover data (though I have not found any supporting code to see how it was implemented), and although I can run these models with nnet/multinom, I would like to be able to use mlogit so I can incorporate a random effect term into my models. Any help would be greatly appreciated!


Solution

  • mlogit can handle this type of data. I think this should be a question about how to write formulas for mlogit.

    The formulas are written outcome ~ choice specific vars | chooser specific vars You want what the documentation calls a pure "multinomial model" which is achieved with call like: summary(mlogit(mode ~ 0 | income, data = Fish)) (this is in ?mlogit).

    > summary(mlogit(cov ~ 0| dr + ds, dat.mlog))
    
    Call:
    mlogit(formula = cov ~ 0 | dr + ds, data = dat.mlog, method = "nr")
    
    Frequencies of alternatives:choice
          CC       CN       CO       NC       NN       NO       OC       ON       OO 
    0.334586 0.262218 0.206767 0.015977 0.080827 0.042293 0.012218 0.014098 0.031015 
    
    nr method
    6 iterations, 0h:0m:1s 
    g'(-H)^-1g = 0.000677 
    successive function values within tolerance limits 
    
    Coefficients :
                      Estimate  Std. Error z-value  Pr(>|z|)    
    (Intercept):CN  6.0043e-01  1.5913e-01  3.7731 0.0001612 ***
    (Intercept):CO -4.0284e-02  1.6288e-01 -0.2473 0.8046562    
    (Intercept):NC -3.4319e+00  4.7527e-01 -7.2210 5.163e-13 ***
    (Intercept):NN -1.1944e+00  2.2550e-01 -5.2969 1.178e-07 ***
    (Intercept):NO -2.4593e+00  3.0252e-01 -8.1293 4.441e-16 ***
    (Intercept):OC -4.5496e+00  5.9764e-01 -7.6127 2.687e-14 ***
    (Intercept):ON -2.5914e+00  4.9472e-01 -5.2381 1.622e-07 ***
    (Intercept):OO -2.4502e+00  3.4192e-01 -7.1660 7.723e-13 ***
    dr:CN          -4.2871e-04  1.0852e-04 -3.9504 7.802e-05 ***
    dr:CO          -7.1969e-05  6.4000e-05 -1.1245 0.2607944    
    dr:NC          -9.9542e-07  1.4613e-04 -0.0068 0.9945650    
    dr:NN           8.1395e-06  7.5500e-05  0.1078 0.9141481    
    dr:NO           4.8619e-05  8.3781e-05  0.5803 0.5617037    
    dr:OC           1.4399e-04  1.0952e-04  1.3147 0.1886161    
    dr:ON           1.5983e-04  1.1730e-04  1.3626 0.1730170    
    dr:OO           7.3787e-05  9.3620e-05  0.7882 0.4306060    
    ds:CN          -3.0704e-04  1.1484e-04 -2.6735 0.0075059 ** 
    ds:CO          -2.3050e-04  1.0504e-04 -2.1944 0.0282063 *  
    ds:NC           2.1811e-04  2.4650e-04  0.8848 0.3762581    
    ds:NN          -1.4912e-04  1.3878e-04 -1.0745 0.2825894    
    ds:NO           1.7379e-04  1.5609e-04  1.1134 0.2655345    
    ds:OC           4.5264e-04  2.3766e-04  1.9046 0.0568347 .  
    ds:ON          -5.5640e-04  3.4013e-04 -1.6358 0.1018749    
    ds:OO          -2.6007e-05  1.9131e-04 -0.1359 0.8918706    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Log-Likelihood: -1726.7
    McFadden R^2:  0.027107 
    Likelihood ratio test : chisq = 96.218 (p.value = 1.7634e-13)
    

    If you do otherwise, you'll get pvalues of 1 and coefficients of 0.

    summary(mlogit(cov ~ dr + ds| 0, dat.mlog))
    
    Call:
    mlogit(formula = cov ~ dr + ds | 0, data = dat.mlog, method = "nr")
    
    Frequencies of alternatives:choice
          CC       CN       CO       NC       NN       NO       OC       ON       OO 
    0.334586 0.262218 0.206767 0.015977 0.080827 0.042293 0.012218 0.014098 0.031015 
    
    nr method
    1 iterations, 0h:0m:0s 
    g'(-H)^-1g = 1E+10 
    last step couldn't find higher value 
    
    Coefficients :
         Estimate Std. Error z-value Pr(>|z|)
    dr 0.0000e+00 9.9283e+10       0        1
    ds 0.0000e+00 1.1212e+11       0        1
    
    Log-Likelihood: -2337.8