In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:
a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)
I like to document the following attributes of a variable:
- name
- description (label, format, scale, etc)
- source (e.g. World bank)
- source media (url and date accessed, CD and ISBN, or whatever)
- file name of the source data on disk (helps when merging codebooks)
- notes
For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:
code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
label=c("Label 1",
"State name",
"Personal identifier",
"Income per capita, thousand of US$, constant year 2000 prices",
"Unique id",
"Calendar year",
"blah",
"bah"),
source=rep("unknown",length(mydata1)),
source_media=rep("unknown",length(mydata1)),
filename = rep("unknown",length(mydata1)),
notes = rep("unknown",length(mydata1))
)
I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.
You could add any special attribute to any R object with the attr
function. E.g.:
x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
And see the given attribute in the structure of the object:
> str(x)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
- attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
And could also load the specified attribute with the same attr
function:
> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x))
while altering the structure will erease the given attributes (see: str(cbind(x,x))
).
UPDATE: based on comments
If you want to list all non-standard attributes, check the following:
setdiff(names(attributes(x)),c("names","row.names","class"))
This will list all non-standard attributes (standard are: names, row.names, class in data frames).
Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)
First, define the uniqe (=non standard) attributes:
uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))
And make a matrix which will hold the names and values:
attribs <- matrix(0,0,2)
Loop through the non-standard attributes and save in the matrix the names and values:
for (i in 1:length(uniqueattrs)) {
attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}
Convert the matrix to a data frame and name the columns:
attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')
And save in any format, eg.:
write.csv(attribs, 'foo.csv')
To your question about the variable labels, check the read.spss
function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo']
where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.
I hope these could help you to write the required functions in a lot neater way than I tried above! :)