Search code examples
rdplyrtidyr

Create a new variable as a sequence of two existing variables


I would like to import an existing format catalog (the numerical ones for the start) from SAS into R. In SAS formats are organized as a starting value to and end value. All integer-values between should be transformed to the target-value (which can be a number or a character).

To use this in R I have to create a new variable which is the result of the sequences of two numerical variables.

An example dataset looks like this:

fname <- c("aformat", "aformat", "aformat", "aformat", "aformat")
svalue <- c(1, 10, 15, 16, 30)
evalue <- c(2, 13, 15, 16, 35)
gvalue <- c("Berne", "Lucerne", "Zurich", "Chur", "Basel")

df <- data.frame(fname, svalue, evalue, gvalue)

I tried to solve the problem with the tidyR-Functions "complete" and "fill", but this also fills the gaps between the sequences of the two variables (for example the value 14 which lies between the sequence of 10:13 and 15:15).

Failed try with complete and fill

output <- df %>%
      group_by(fname) %>%
         complete(evalue = full_seq(min(svalue):max(evalue), 1)) %>%
            select(fname, evalue, gvalue) %>%
               fill(gvalue, .direction = "up")

Solution

  • You can use reframe to "expand" rows of a dataframe.

    library(dplyr)
    
    reframe(df, evalue = seq(svalue, evalue), .by = c(fname, gvalue))
    
         fname  gvalue evalue
    1  aformat   Berne      1
    2  aformat   Berne      2
    3  aformat Lucerne     10
    4  aformat Lucerne     11
    5  aformat Lucerne     12
    6  aformat Lucerne     13
    7  aformat  Zurich     15
    8  aformat    Chur     16
    9  aformat   Basel     30
    10 aformat   Basel     31
    11 aformat   Basel     32
    12 aformat   Basel     33
    13 aformat   Basel     34
    14 aformat   Basel     35