Search code examples
rdataframedata-manipulationretain

Hold current value until non-null value occurs


Hi I come from a background in SAS and I am relatively new to R. I am attempting to convert an existing SAS program into equivalent R code

I am unsure how to achieve the equivalent of SAS's "retain" and "by" Behavior in R

I have a dataframe with two columns first column is a date column and the second column is a numeric value.

The numeric column represents a result from lab test. The test is conducted semi-regularly so on some days there will be Null values in the data. The data is ordered by date and the dates are sequential.

i.e example data looks like this

Date        Result
2017/01/01      15
2017/01/02      NA
2017/01/03      NA
2017/01/04      12
2017/01/05      NA
2017/01/06      13
2017/01/07      11
2017/01/08      NA

I would like to create a third column which would contain the most recent result. If Result column is Null it should be set to most recent previously non Null Result otherwise it should contain the Result value

My desired output would look like this:

Date        Result  My_var
2017/01/01      15  15
2017/01/02      NA  15
2017/01/03      NA  15
2017/01/04      12  12
2017/01/05      NA  12
2017/01/06      13  13
2017/01/07      11  11
2017/01/08      NA  11

In SAS I can achieve this with something like following code snippet:

data my_data;
    retain My_var;
    set input_data;
    by date;

    if Result not = . then
        my_var = result;
run;

I am stumped as to how to do this in R I do not think R supports By group processing as in SAS - or at least I don't know how to set that as option.

I have naively tried:

my_data <- mutate(input_data, my_var = if(is.na(Result)) {lag(Result)} else {Result})

But I do not think that syntax is correct.


Solution

  • We can use na.locf function from the package to fill in the missing values.

    library(zoo)
    
    dt$My_var <- na.locf(dt$Result)
    dt
    #         Date Result My_var
    # 1 2017/01/01     15     15
    # 2 2017/01/02     NA     15
    # 3 2017/01/03     NA     15
    # 4 2017/01/04     12     12
    # 5 2017/01/05     NA     12
    # 6 2017/01/06     13     13
    # 7 2017/01/07     11     11
    # 8 2017/01/08     NA     11
    

    Or the fill function from the package.

    library(dplyr)
    library(tidyr)
    dt <- dt %>% 
      mutate(My_var = Result) %>%
      fill(My_var)
    dt
    #         Date Result My_var
    # 1 2017/01/01     15     15
    # 2 2017/01/02     NA     15
    # 3 2017/01/03     NA     15
    # 4 2017/01/04     12     12
    # 5 2017/01/05     NA     12
    # 6 2017/01/06     13     13
    # 7 2017/01/07     11     11
    # 8 2017/01/08     NA     11
    

    DATA

    dt <- read.table(text = "Date        Result
    2017/01/01      15
                     2017/01/02      NA
                     2017/01/03      NA
                     2017/01/04      12
                     2017/01/05      NA
                     2017/01/06      13
                     2017/01/07      11
                     2017/01/08      NA",
                     header = TRUE, stringsAsFactors = FALSE)