Search code examples
rmemorytime-seriesregressionstata

Stata vs. R: Substantial difference in memory usage during pooled OLS (Panel)


I am attempting to run a pooled OLS regression on a panel dataset of about 34,000 observations. When calling lm() in R, this process takes forever and ends up consuming over 30GB of memory (hence, it goes out-of-RAM whilst estimating the regression). In fact, I had to force quit the program as my computer almost crashed.

When I run the exact same regression in Stata (on the same dataset), this process takes roughly 1 second. I do not follow what is going on here, am I doing something wrong?

R Code:

pooled1=lm(ret ~ l_ret + l_btm + l_roe, data=panel)

Stata Code:

reg ret l_ret l_btm l_roe, r

Stata Output

R Memory Usage

Stata Browser

R Browser

str(Panel)

summary(panel)


Solution

  • Your $l_ret variable is a character vector. Try converting it to a numeric vector Panel$l_ret <- as.numeric(Panel$l_ret), and run your analysis again. Also your data.frame is a tibble object. This should not slow R down, but you might also want to try converting Panel to a data.frame to minimize any interference. You can do this by Panel <- as.data.frame(Panel).