When I used to program in SAS, it was easy to loop across variables and generate rows after applying some sort of logic. I would like to do this in R. To illustrate, consider the following scenario.
Input
+----+----------------+-------------------+-------------+
| ID | lvl1 | lvl2 | lvl3 |
+----+----------------+-------------------+-------------+
| 1 | United States | Ohio | Cincinnati |
| 2 | Ohio | Cincinnati | |
| 3 | Canada | British Columbia | Vancouver |
+----+----------------+-------------------+-------------+
Let array _lvl contain lvl1, lvl2, and lvl3 and loop from i = 1 to 3 (the upper bound).
This should produce the following for ID = 2.
Output
+----+-------+-------------+-------+
| ID | lvl1 | lvl2 | lvl3 |
+----+-------+-------------+-------+
| 2 | Ohio | | |
| 2 | Ohio | Cincinnati | |
+----+-------+-------------+-------+
Thank you in advance.
I'm not too keen on your particular use case, but I'm sympathetic to the general struggle of adjusting to R after using the SAS data step in this way.
The way I emulate output statements in data step in R, is by applying functions along the rows (or in this case, columns) of a data frame.
Define your "array" of column names:
lvl <- c('lvl1','lvl2','lvl3')
Then mimic the SAS output statements using a function (with i
playing the same role as the index of your SAS array, data
representing the input data frame, and names
being the entire vector of columns names to iterate over):
outputcriteria <- function(i, data, names) {
data <- data[!is.na(data[,names[i]]),] #exclude rows with NA in the current column
data[,names[-(1:i)]] <- NA #blank out columns after the current column
return (data)
}
Then apply this function to your data frame df
with lapply()
by setting the input vector to 1:length(lvl)
(the index of your vector of column names), and also passing the data
and name
arguments into the function:
l <- lapply(1:length(lvl), outputcriteria, data = df, names = lvl)
The result l
is a list of three data frames (one for each iteration), which you can rbind
into a single data frame:
do.call(rbind, l)
The main difference is that you end up reading the data three times and outputting a data frame each time, as opposed to reading it once in SAS.