The problem that I have with make
always rebuilding Makefile
targets (make always rebuilds Makefile targets) and its investigation uncovered another issue, which is the subject of this question. Repeated execution of the following R
code results in a loss of objects' attributes during data transformation operations.
For the record, I have to say that I've already written on this subject (Approaches to preserving object's attributes during extract/replace operations), but that question and answer were more general (and I was incorrect that simple saving attributes works - it worked for me as of that writing, because at the time I haven't been performing operations, potentially dangerous for objects' attributes).
The following are excerpts from my R code, where I'm experiencing loss of attributes.
##### GENERIC TRANSFORMATION FUNCTION #####
transformResult <- function (dataSource, indicator, handler) {
fileDigest <- base64(indicator)
rdataFile <- paste0(CACHE_DIR, "/", dataSource, "/",
fileDigest, RDS_EXT)
if (file.exists(rdataFile)) {
data <- readRDS(rdataFile)
# Preserve user-defined attributes for data frame's columns
# via defining new class 'avector' (see code below)). Also,
# preserve attributes (comments) for the data frame itself.
data2 <- data.frame(lapply(data, function(x)
{ structure(x, class = c("avector", class(x))) } ))
#mostattributes(data2) <- attributes(data)
attributes(data2) <- attributes(data)
result <- do.call(handler, list(indicator, data2))
saveRDS(result, rdataFile)
rm(result)
}
else {
error("RDS file for \'", indicator, "\' not found! Run 'make' first.")
}
}
## Preserve object's special attributes:
## use a class with a "as.data.frame" and "[" method
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function (x, ...) {
#attr <- attributes(x)
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
#attributes(r) <- attr
return (r)
}
##### HANDLER FUNCTION DEFINITIONS #####
projectAge <- function (indicator, data) {
# do not process, if target column already exists
if ("Project Age" %in% names(data)) {
message("Project Age: ", appendLF = FALSE)
message("Not processing - Transformation already performed!\n")
return (invisible())
}
transformColumn <- as.numeric(unlist(data["Registration Time"]))
regTime <- as.POSIXct(transformColumn, origin="1970-01-01")
prjAge <- difftime(Sys.Date(), as.Date(regTime), units = "weeks")
data[["Project Age"]] <- as.numeric(round(prjAge)) / 4 # in months
# now we can delete the source column
if ("Registration Time" %in% names(data))
data <- data[setdiff(names(data), "Registration Time")]
if (DEBUG2) {print(summary(data)); print("")}
return (data)
}
projectLicense <- function (indicator, data) {
# do not process, if target column (type) already exists
if (is.factor(data[["Project License"]])) {
message("Project License: ", appendLF = FALSE)
message("Not processing - Transformation already performed!\n")
return (invisible())
}
data[["Project License"]] <-
factor(data[["Project License"]],
levels = c('gpl', 'lgpl', 'bsd', 'other',
'artistic', 'public', '(Other)'),
labels = c('GPL', 'LGPL', 'BSD', 'Other',
'Artistic', 'Public', 'Unknown'))
if (DEBUG2) {print(summary(data)); print("")}
return (data)
}
devTeamSize <- function (indicator, data) {
var <- data[["Development Team Size"]]
# convert data type from 'character' to 'numeric'
if (!is.numeric(var)) {
data[["Development Team Size"]] <- as.numeric(var)
}
if (DEBUG2) {print(summary(data)); print("")}
return (data)
}
##### MAIN #####
# construct list of indicators & corresponding transform. functions
indicators <- c("prjAge", "prjLicense", "devTeamSize")
transforms <- list(projectAge, projectLicense, devTeamSize)
# sequentially call all previously defined transformation functions
lapply(seq_along(indicators),
function(i) {
transformResult("SourceForge",
indicators[[i]], transforms[[i]])
})
After the second run of this code, names "Project Age" and "Project License" as well as other user-defined attributes of the data frame data2
are lost.
My question here is multifaceted:
1) what statements in my code could lead to loss of attributes AND WHY;
2) what is the correct line of code (mostattributes <- attributes
or attributes <- attributes/attr
) in transformResult()
and avector
class definition AND WHY;
3) is the statement as.data.frame.avector <- as.data.frame.vector
really needed, if I add class attribute avector
to a data frame object and, in general, prefer a generic solution (applicable not only to data frames); WHY OR WHY NOT.
4) saving via attr
in class definition doesn't work, it fails with the following error:
Error in attributes(r) <- attr :
'names' attribute [5] must be the same length as the vector [3]
Calls: lapply ... summary.data.frame -> lapply -> FUN -> summary.default -> [ -> [.avector
So, I had to go back to using mostattributes()
. Is it OK?
==========
I have read the following on the subject:
SO question: How to delete a row from a data.frame without losing the attributes (I like the solution by Ben Barns, but it differs a bit from the one suggested by Gabor Grothendieck and Marc Schwartz - see below);
SO question: indexing operation removes attributes (while the solution is legible, I prefer one, based on class definition /sub-classing?/);
A generic solution suggested by Heinz Tuechler (https://stat.ethz.ch/pipermail/r-help/2006-July/109148.html) - Do I need this?;
An explanation by Brian Ripley (http://r.789695.n4.nabble.com/Losing-attributes-in-data-frame-PR-10873-tp919265p919266.html) - I found it somewhat confusing;
A solution suggested by Gabor Grothendieck (https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html);
An explanation of Gabor Grothendieck's solution by Marc Schwartz (https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html) - very nice explanation;
Sections 8.1.28 and 8.1.29 of the "R Inferno" book (www.burns-stat.com/pages/Tutor/R_inferno.pdf) - I've tried his suggestions of using storage.mode()
, but doesn't really solve the problem, as coercing via storage
doesn't affect class
of an object (not to mention that it doesn't cover other than coercion attribute-clearing operations, such as subsetting and indexing;
http://stat.ethz.ch/R-manual/R-devel/library/base/html/attributes.html;
http://cran.r-project.org/doc/manuals/r-devel/R-lang.html#Copying-of-attributes.
P.S. I believe that this question is of general nature, so I haven't provided a reproducible example at this time. I hope that it's possible to answer this without such example, but, if not, please let me know.
I'm answering my own question - well, for now, only partially:
1) Under more intense investigation and after some code updates, it appears that attributes in fact are NOT being lost (still trying to figure out what changes caused the expected behavior - will report later).
2) I have figured out the reason of intermittent output and losing all cache data after the transformation, as follows. During multiple subsequent runs of the code, the second run of each transformation (handler) function (projectAge()
, projectLicense()
and devTeamSize()
) returns NULL, since the transformation has already been done:
if (<condition>) {
...
message("Not processing - Transformation already performed!\n")
return (invisible()) # <= returns NULL
}
The returned NULL then was getting passed to saveRDS()
, thus, causing the loss of cache data.
I fixed this problem by simple validation of result
before saving the transformed object:
# the next line is problematic due to wrong assumption of always having full data returned
result <- do.call(handler, list(indicator, data2))
if (!is.null(result)) saveRDS(result, rdataFile) # <= fixed by validating incoming data
That's it so far, thanks for reading! I will be updating this answer until all the issues are clarified.