I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
# large example that works
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Based on my investigation below, @James figured out that the problem can be avoided by providing non-NULL
value for the parameter sparse
in the call of the speedglm
function, as it prevents the internal call of the is.sparse
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
My original answer:
Both the warning and the error come from the same line in the function is.sparse
in the package speedglm
The line is:
The warning happens because of the use of nrow(X)*ncol(X)
for a large matrix. The nrow
and ncol
functions return integer
values, which can overflow. here is an illustration.
nr = 1000000L
nc = 1000000L
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
The error happens because the sample
function is confused when X is a large matrix and size = NA
. Here is an illustration:
sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n, :
# missing value where TRUE/FALSE needed