Search code examples
rdataframecsvtidyversespss

Convert a small dataset written in SPSS to CSV


I have a small dataset written in SPSS syntax which comes from Table 5.3 p. 189 of this book (type 210 in the page slot to see the table).

I was wondering if there might be a way to convert this data to .csv file? (I want to use the data in R afterwards)

# SPSS Code:

DATA LIST FREE/gpid anx socskls assert.
BEGIN DATA.
1 5 3 3     1 5 4 3     1 4 5 4     1 4 5 4
1 3 5 5     1 4 5 4     1 4 5 5     1 4 4 4
1 5 4 3     1 5 4 3     1 4 4 4
2 6 2 1     2 6 2 2     2 5 2 3     2 6 2 2
2 4 4 4     2 7 1 1     2 5 4 3     2 5 2 3
2 5 3 3     2 5 4 3     2 6 2 3
3 4 4 4     3 4 3 3     3 4 4 4     3 4 5 5
3 4 5 5     3 4 4 4     3 4 5 4     3 4 6 5
3 4 4 4     3 5 3 3     3 4 4 4
END DATA.

EDIT - in order to check answers I am adding here the actual way the data looks after reading it in SPSS :

gpid   anx  socskls   assert
1       5       3       3
1       5       4       3
1       4       5       4
1       4       5       4
1       3       5       5
1       4       5       4
1       4       5       5
1       4       4       4
1       5       4       3
1       5       4       3
1       4       4       4
2       6       2       1
2       6       2       2
2       5       2       3
2       6       2       2
2       4       4       4
2       7       1       1
2       5       4       3
2       5       2       3
2       5       3       3
2       5       4       3
2       6       2       3
3       4       4       4
3       4       3       3
3       4       4       4
3       4       5       5
3       4       5       5
3       4       4       4
3       4       5       4
3       4       6       5
3       4       4       4
3       5       3       3
3       4       4       4

Solution

  • If I understand correctly, the 1st, 5th, 9th, and 13th column of the dataset belong to variable gpid, the 2nd, 6th, 10th, and 14th column belong to variable anx, and so on. So, we need to

    • reshape from wide to long format
    • with multiple measure variables
    • where each measure variable spans several columns
    • and where some values are missing.

    Many roads lead to Rome.

    This is what I would do using my favourite tools. In particular, this approach uses the feature of data.table::melt() to reshape multiple measure columns simultaneously. There is no manual cleanup of the data section in a text editor required.

    The resulting dataset result can be used directly afterwards in any subsequent R code as requested by the OP. There is no need to take a detour using a .csv file (However, feel free to save result as a .csv file).

    library(data.table)
    library(magrittr)
    cols <- c("gpid", "anx", "socskls", "assert")
    raw <- fread(text = "
    1 5 3 3     1 5 4 3     1 4 5 4     1 4 5 4
    1 3 5 5     1 4 5 4     1 4 5 5     1 4 4 4
    1 5 4 3     1 5 4 3     1 4 4 4
    2 6 2 1     2 6 2 2     2 5 2 3     2 6 2 2
    2 4 4 4     2 7 1 1     2 5 4 3     2 5 2 3
    2 5 3 3     2 5 4 3     2 6 2 3
    3 4 4 4     3 4 3 3     3 4 4 4     3 4 5 5
    3 4 5 5     3 4 4 4     3 4 5 4     3 4 6 5
    3 4 4 4     3 5 3 3     3 4 4 4",
    fill = TRUE)
    mv <- colnames(raw) %>% 
      matrix(ncol = 4L, byrow = TRUE) %>% 
      as.data.table() %>% 
      setnames(new = cols)
    result <- melt(raw, measure.vars = mv, na.rm = TRUE)[
      order(rowid(variable))][
        , variable := NULL]
    result
    
        gpid anx socskls assert
     1:    1   5       3      3
     2:    1   5       4      3
     3:    1   4       5      4
     4:    1   4       5      4
     5:    1   3       5      5
     6:    1   4       5      4
     7:    1   4       5      5
     8:    1   4       4      4
     9:    1   5       4      3
    10:    1   5       4      3
    11:    1   4       4      4
    12:    2   6       2      1
    13:    2   6       2      2
    14:    2   5       2      3
    15:    2   6       2      2
    16:    2   4       4      4
    17:    2   7       1      1
    18:    2   5       4      3
    19:    2   5       2      3
    20:    2   5       3      3
    21:    2   5       4      3
    22:    2   6       2      3
    23:    3   4       4      4
    24:    3   4       3      3
    25:    3   4       4      4
    26:    3   4       5      5
    27:    3   4       5      5
    28:    3   4       4      4
    29:    3   4       5      4
    30:    3   4       6      5
    31:    3   4       4      4
    32:    3   5       3      3
    33:    3   4       4      4
        gpid anx socskls assert
    

    Some explanations

    fread() returns a data.table raw with default column names V1, V2, ... V16 and with missing values filled with NA

    mv is a data.table which indicates which columns of raw belong to each target variable:

    mv
    
       gpid anx socskls assert
    1:   V1  V2      V3     V4
    2:   V5  V6      V7     V8
    3:   V9 V10     V11    V12
    4:  V13 V14     V15    V16
    

    This informations is used by melt(). melt() also removes rows with missing values from the resulting long format.

    After reshaping, the rows are ordered by the variable number but need to be reordered in the original row order by using rowid(variable). Finally, the variable column is removed.

    EDIT: Improved version

    Giving a second thought, here is a streamlined version of the code which skips the creation of mv and uses data.table chaining:

    library(data.table)
    cols <- c("gpid", "anx", "socskls", "assert")
    result <- fread(
      text = "
    1 5 3 3     1 5 4 3     1 4 5 4     1 4 5 4
    1 3 5 5     1 4 5 4     1 4 5 5     1 4 4 4
    1 5 4 3     1 5 4 3     1 4 4 4
    2 6 2 1     2 6 2 2     2 5 2 3     2 6 2 2
    2 4 4 4     2 7 1 1     2 5 4 3     2 5 2 3
    2 5 3 3     2 5 4 3     2 6 2 3
    3 4 4 4     3 4 3 3     3 4 4 4     3 4 5 5
    3 4 5 5     3 4 4 4     3 4 5 4     3 4 6 5
    3 4 4 4     3 5 3 3     3 4 4 4",
      fill = TRUE, col.names = rep(cols, 4L))[
        , melt(.SD, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)][
          order(rowid(variable))][
            , variable := NULL][]
    result
    

    Here, the columns are renamed within the call to fread(). In this case, duplicated column names are desirable (as opposed to the usual use case) because the patterns() function in the subsequent call to melt() use the duplicated column names to combine the columns which belong to one measure variable.