Search code examples
rdataframedatatabledata-wranglingdat-protocol

How to properly load .dat file in r without variable labels (NBER CPS data)


I am trying to load into r and wrangle data from the CPS (Current Population Survey) which can be downloaded at this link. There is an ostensible codebook for the information on variables and the ranges of each variable as coded in files such as those at this link and this link (Both examples are for the 2004 dataset). However, when I try to load the data file into r, I encounter issues where the variables seem to match up but there are gaps. If I open the 2004 file with the following code:

library(data.table)
cps_raw <- read.delim("cpsmay04.dat", header = FALSE, sep = "")

This gives me the following output:

image1

which differs from when I change the separator to be a full space:

library(data.table)
cps_raw <- read.delim("cpsmay04.dat", header = FALSE, sep = " ")

giving the following output: image2

I am inclined to think the first one is correct, though I'm not sure it gets all the columns correct, since the spaces between values are different. The codebook does not seem to indicate how the data should be loaded in, and I am only used to working with .csv, .dta, .xlsx or .sav files. If the first option is correct, how do I know which variable is which? Do I simply index by the codebook? If so, is there a more efficient way to rename the variables throughout the dataset when loading it into r? Am I loading it incorrectly such that the variable names are not loaded, or must this be done manually? If it must be done manually, is there a way to only load part of the dataframe (certain columns) to not have to deal with many variables that are not necessary, since there are so many?

UPDATE to troubleshoot Rui's solution

While the solution mostly appears to work, the 04 file seems not to take.

For example, when using the 1997 file, the HURESPLI variable has no NA values and no -1 values, which are not detailed in the variable codebook (values are between 0-99). The variable distribution is the following:

image3

However, when trying to load the same variable for the 2004 file now there are many -1's which do not match the variable codebook as a defined value. This variable should be available for all individual observations, and should not be repeated for values with the same household ID (HRHHID) since it represents a unique respondent within a household:

image4

It also produces a new variable not specified in the .ddf file, HURESPLI_61 which seems to be the same variable plus the column number after the HURESPLI column, which has a different set of values:

image5

This no longer has all the -1 values, but I would still be skeptical of it being the true values of HURESPLI, because the numbers above 0 should represent the number of the respondent within household in the case where there is more than 1 respondent per household (so there should be mostly 1, since most respondents would be in a multiperson household, but information on other household members would not be as often provided. But there being only 103 zeros in the 2004 file after 9000 in the 97 file leaves me skeptical.

Is there any reason why the code would be creating additional columns, and can I be sure that the variables are accurate, as HURESPLI does not look right?


Solution

  • The following code was tested with the data in the question link, no need for SPSS , SAS or STATA files.
    The main problem is to parse the .ddf file, which is not that difficult considering that the data width lines always begin with the character "D". Filter those lines, split by spaces and coerce the two last columns to integer. Then use column "SIZE" as column widths.

    process_ddf <- function(file) {
      ddf <- readLines(file)
      i <- grep("^D ", ddf)
      mat <- ddf[i] |>
        strsplit(" +") |>
        sapply(`[`, -1L) |>
        t()
      out <- mat |>
        as.data.frame() |>
        setNames(c("DATA", "SIZE", "BEGIN"))
      out[-1L] <- lapply(out[-1L], as.integer)
      if(anyNA(out[-1L])) {
        mat[, -1L] <- apply(mat[, -1L], 2L, \(x) gsub("\\D", "", x))
        out[, -1L] <- apply(mat[, -1L], 2L, as.integer)
      }
      out
    }
    
    data_file <- "cpsmay04.dat"
    ddf_link <- "http://data.nber.org/cps/cpsmay04.ddf"
    d <- process_ddf(ddf_link)
    # some columns are fillers, need to make unique col names
    filler <- grep("FILLER", d$DATA)
    padding <- grep("PADDING", d$DATA)
    d$DATA[filler] <- paste(d$DATA[filler], filler, sep = "_")
    d$DATA[padding] <- paste(d$DATA[padding], padding, sep = "_")
    
    # now read in the data
    # read.fwf is painfully slow, use readr::read_fwf
    library(readr)
    cpsmay04 <- read_fwf(data_file, fwf_widths(d$SIZE, d$DATA))
    
    # optional
    spc <- spec(cpsmay04)
    spc