Search code examples
rspss

Getting an SPSS data file into R


At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).

I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.

AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.

Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.

Here's a sample data file.

I'd appreciate any help resolving this issue.

Thanks.


Solution

  • There is a solution to read SPSS data file in R by ODBC driver.

    1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.

    2) Create a DSN for the SPSS data driver.

    3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.

    Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.

    require(RODBC)
    
    # Create connection
    # Change the DSN name and CP_CONNECT_STRING according to your setting
    con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
    
    # List of tables
    Tables <- sqlTables(con)
    Tables
    
    # List of table names to extract
    table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
    
    # Function to query a table by name
    sqlQuery.tab.name <- function(table) {
      sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
    }
    
    # Retrieve all tables
    Data <- lapply(table.names, sqlQuery.tab.name)
    
    # See the data
    lapply(Data, head)
    
    # Close connection
    close(con)
    

    For example we can that value labels are defined for two variables:

    [[5]]
      VAR00002 VAR00002_label
    1        1           Male
    2        2         Female
    
    [[6]]
      VAR00003 VAR00003_label
    1        2        Student
    2        3       Employed
    3        4     Unemployed
    

    Additional information

    Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.

    I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).

    get.spss <- function(channel, variables = NULL, value.labels = F) {
    
      VarNames <- sqlQuery(channel = channel,
        query = "SELECT VarName FROM [Variables]", as.is = T)$VarName
    
      if (is.null(variables)) variables <- VarNames else {
        if (any(!variables %in% VarNames)) stop("Wrong variable names")
      }
    
      if (value.labels) {
        ValueLabelTableName <- sqlQuery(channel = channel,
          query = "SELECT VarName FROM [Variables]
                                        WHERE ValueLabelTableName is not null",
          as.is = T)$VarName
        ValueLabelTableName <- intersect(variables, ValueLabelTableName)
      }
    
      variables <- paste(variables, collapse = ", ")
    
      data <- sqlQuery(channel = channel,
        query = paste("SELECT", variables, "FROM [Cases]"),
        as.is = T)
    
      if (value.labels) {
        for (var in ValueLabelTableName) {
          VL <- sqlQuery(channel = channel,
            query = paste0("SELECT * FROM [VLVAR", var,"]"),
            as.is = T)
          data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
        }
      }
    
      return(data)
    }