Search code examples
rmetadatadata-management

How to create, structure, maintain and update data codebooks in R?


In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:

a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)

I like to document the following attributes of a variable:

  • name
  • description (label, format, scale, etc)
  • source (e.g. World bank)
  • source media (url and date accessed, CD and ISBN, or whatever)
  • file name of the source data on disk (helps when merging codebooks)
  • notes

For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:

code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
     label=c("Label 1",
              "State name",
              "Personal identifier",
              "Income per capita, thousand of US$, constant year 2000 prices",
              "Unique id",
              "Calendar year",
              "blah",
              "bah"),
      source=rep("unknown",length(mydata1)),
      source_media=rep("unknown",length(mydata1)),
      filename = rep("unknown",length(mydata1)),
      notes = rep("unknown",length(mydata1))
)

I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.


Solution

  • You could add any special attribute to any R object with the attr function. E.g.:

    x <- cars
    attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."
    

    And see the given attribute in the structure of the object:

    > str(x)
    'data.frame':   50 obs. of  2 variables:
     $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
     $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
     - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."
    

    And could also load the specified attribute with the same attr function:

    > attr(x, "source")
    [1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."
    

    If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x)) while altering the structure will erease the given attributes (see: str(cbind(x,x))).


    UPDATE: based on comments

    If you want to list all non-standard attributes, check the following:

    setdiff(names(attributes(x)),c("names","row.names","class"))
    

    This will list all non-standard attributes (standard are: names, row.names, class in data frames).

    Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)

    First, define the uniqe (=non standard) attributes:

    uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))
    

    And make a matrix which will hold the names and values:

    attribs <- matrix(0,0,2)
    

    Loop through the non-standard attributes and save in the matrix the names and values:

    for (i in 1:length(uniqueattrs)) {
        attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
    }
    

    Convert the matrix to a data frame and name the columns:

    attribs <- as.data.frame(attribs)
    names(attribs) <- c('name', 'value')
    

    And save in any format, eg.:

    write.csv(attribs, 'foo.csv')
    

    To your question about the variable labels, check the read.spss function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo'] where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.

    I hope these could help you to write the required functions in a lot neater way than I tried above! :)