Search code examples
rloopssashierarchical-data

loop through columns and generate rows for each record R


When I used to program in SAS, it was easy to loop across variables and generate rows after applying some sort of logic. I would like to do this in R. To illustrate, consider the following scenario.

Input
+----+----------------+-------------------+-------------+
| ID |      lvl1      |        lvl2       |     lvl3    |
+----+----------------+-------------------+-------------+
|  1 |  United States |  Ohio             |  Cincinnati |
|  2 |  Ohio          |  Cincinnati       |             |
|  3 |  Canada        |  British Columbia |  Vancouver  |
+----+----------------+-------------------+-------------+

Let array _lvl contain lvl1, lvl2, and lvl3 and loop from i = 1 to 3 (the upper bound).

  1. Loop 1: _lvl(1) Check whether lvl1 is null and output if not, supplying null values for the other values.
  2. Loop 2: _lvl(2) Check whether lvl1 is null or lvl2 is null and output if not, supplying null for lvl3.
  3. Loop 3: _lvl(3) Check whether lvl1 is null, lvl2 is null, or lvl3 is null and output if not.

This should produce the following for ID = 2.

Output
+----+-------+-------------+-------+
| ID |  lvl1 |     lvl2    |  lvl3 |
+----+-------+-------------+-------+
|  2 |  Ohio |             |       |
|  2 |  Ohio |  Cincinnati |       |
+----+-------+-------------+-------+

Thank you in advance.


Solution

  • I'm not too keen on your particular use case, but I'm sympathetic to the general struggle of adjusting to R after using the SAS data step in this way.

    The way I emulate output statements in data step in R, is by applying functions along the rows (or in this case, columns) of a data frame.

    Define your "array" of column names:

    lvl <- c('lvl1','lvl2','lvl3')
    

    Then mimic the SAS output statements using a function (with i playing the same role as the index of your SAS array, data representing the input data frame, and names being the entire vector of columns names to iterate over):

    outputcriteria <- function(i, data, names) {
      data <- data[!is.na(data[,names[i]]),] #exclude rows with NA in the current column
      data[,names[-(1:i)]] <- NA             #blank out columns after the current column
      return (data)
    }
    

    Then apply this function to your data frame df with lapply() by setting the input vector to 1:length(lvl) (the index of your vector of column names), and also passing the data and name arguments into the function:

    l <- lapply(1:length(lvl), outputcriteria, data = df, names = lvl)
    

    The result l is a list of three data frames (one for each iteration), which you can rbind into a single data frame:

    do.call(rbind, l)
    

    The main difference is that you end up reading the data three times and outputting a data frame each time, as opposed to reading it once in SAS.