Search code examples
rplm

R plm thinks my number vector is a factor, why?


With this data input:

A   B   C   D
0.0513748973337 0.442624990365  0.044669941640565   12023787.0495
-0.047511808790502  0.199057057555  0.067542653775225   6674747.75598
0.250333519823608   0.0400359422093 -0.062361320324768  10836244.44
0.033600922318947   0.118359141703  0.048493523722074   7521473.94034
0.00492552770819    0.0851342003243 0.027123088894137   8742685.39098
0.02053037069955    0.0535545969759 0.06352586720282    8442677.4204
0.09050961131549    0.044871795257  0.049363888991624   7223126.70424
0.082789930841618   0.0230375009412 0.090676778601245   8974611.5623
0.06396481119371    0.0467280364963 0.128097065131764   8167179.81463

and this code:

library(plm);
mydata <- read.csv("reproduce_small.csv", sep = "\t");
plm(C ~ log(D), data = mydata, model = "pooling"); # works
plm(A ~ log(B), data = mydata, model = "pooling"); # error

the second plm call returns the following error:

Error in Math.factor(B) : ‘log’ not meaningful for factors

reproduce_small.csv contains the ten lines of data pasted above. Obviously, B is not a factor, it is clearly a numeric vector. This means that plm thinks it is a factor. The questions are "why?", but more importantly "how do I fix this?"

Things I've tried:

#1) mydata$B.log <- log(mydata$B) results in

Error in model.frame.default(formula = y ~ X - 1, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'X')

which is in itself weird, since A and B.log have clearly the same length.

#2) plm(A ~ log(D), data = mydata, model = "pooling"); results in the same error as #1.

#3) plm(C ~ log(B), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).

#4) plm(A ~ log(B + 1), data = mydata, model = "pooling"); results in

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(B, 1) : ‘+’ not meaningful for factors

#5) plm(A ~ as.numeric(as.character(log(B))), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).

EDIT: As suggested, I'm including the result of str(mydata):

> str(mydata)
'data.frame':   9 obs. of  4 variables:
 $ A: num  0.05137 -0.04751 0.25033 0.0336 0.00493 ...
 $ B: num  0.4426 0.1991 0.04 0.1184 0.0851 ...
 $ C: num  0.0447 0.0675 -0.0624 0.0485 0.0271 ...
 $ D: num  12023787 6674748 10836244 7521474 8742685 ...

Also trying mydata <- read.csv("reproduce_small.csv", sep = "\t", stringsAsFactors = FALSE); didn't work.


Solution

  • Helix123 in the comments pointed out that the data.frame should be converted to a pdata.frame. So, for instance, a solution to this toy example will be:

    mydata$E <- c("x", "x", "x", "x", "x", "y", "y", "y", "y"); # Create E as an "index"
    mydata <- pdata.frame(mydata, index = "E"); # convert to pdata.frame
    plm(A ~ log(B), data = mydata, model = "pooling"); # now it works!
    

    EDIT: As to "why" this happens, as Helix123 pointed out in comments, is that, when passed a data.frame instead of a pdata.frame, plm quietly assumes that the first two columns are indexes, and converts them to factor under the hood. Then plm will throw an unhelpful error, instead of launching a warning that the object passed is not of the correct type, or that it made an assumption at all.