I am trying to run a generalized linear model on a very large dataset (several million rows). R doesn't seem able to handle the analysis, however, as I keep getting memory allocation errors (unable to allocate vector of size...etc.).
The data fit in RAM, but seem to be too large to estimate complex models. As a solution, I'm exploring using the ff package to replace r's in-RAM storage mechanism with on-disk storage.
I have successfully (I think) off-loaded the data to my hard drive, but when I attempt to estimate the glm (via the biglm package) I get the following error:
Error: $ operator is invalid for atomic vectors
I'm not sure why I'm getting this specific error when I use the bigglm function. When I run the glm on the full dataset, it doesn't give me this specific error, though perhaps r is running out of memory before it gets far enough for the "operator is invalid" error to trigger.
I've provided an example data set and code below. Note that the standard glm runs just fine on this sample data. The problem arises when using biglm.
Please let me know if you have any questions.
Thank you in advance!
#Load required packages
library(readr)
library(ff)
library(ffbase)
library(LaF)
library(biglm)
#Create sample data
df <- data.frame("id" = as.character(1:20), "group" = rep(seq(1:5), 4),
"x1" = as.character(rep(c("a", "b", "c", "d"), 5)),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
#Write data to file
write_csv(df, "df.csv")
#Create connection to sample data using laf
con <- laf_open_csv(filename = "df.csv",
column_types = c("string", "string", "string",
"double", "string"),
column_names = c("id", "group", "x1", "x2", "y"),
skip = 1)
#Use LaF to import data into ffdf object
ff <- laf_to_ffdf(laf = con)
#Fit glm on data stored in RAM (note this model runs fine)
fit.glm <- glm(y ~ factor(x1) + x2 + factor(group), data=df,
family="binomial")
#Fit glm on data stored on hard-drive (note this model fails)
fit.big <- bigglm(y ~ factor(x1) + x2 + factor(group), data=ff,
family="binomial")
You are using the wrong family argument.
library(ffbase)
library(biglm)
df <- data.frame("id" = factor(as.character(1:20)), "group" = factor(rep(seq(1:5), 4)),
"x1" = factor(as.character(rep(c("a", "b", "c", "d"), 5))),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
d <- as.ffdf(df)
fit.big <- bigglm.ffdf(y ~ x1 + x2 , data = d,
family = binomial(link = "logit"), chunksize = 3)