Search code examples
for-loopif-statementconditional-statementsapplymutate

making four new columns based on 8 existing columns


Below you can see the reproduced sample of my data.

DATA <- structure(list(ID = c("101", "101", "101", "101", "101", "101","101", "101", "101", "101"), IDA = c("1", "1", "2", "3", "4","5", "5", "1859", "1860", "1861"), DATE = structure(c(1300928400,1277946000, 1277946000, 1278550800, 1278550800, 1453770000, 1329958800,1506474000, 1485133200, 1485133200), tzone = "UTC", class = c("POSIXct","POSIXt")), NR = c("CH-0001", "CH-0001","CH-0002", "CH-0003", "CH-0004", "CH-0005","CH-0005", "CH-1859", "CH-1860", "CH-1861"), PAT = c("101-1", "101-1", "101-2", "101-3", "101-4", "101-5","101-5", "101-1859", "101-1860", "101-1861"), INT1 = c(245005,280040, 280040, 280040, 280040, 240040, 240040, NA, NA, NA),INT2 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), INT3 = c(NA_real_,NA_real_, 280010, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, 245035, NA_real_), INT4 = c(NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_), INTX1 = c(NA_real_, 275040, NA_real_,NA_real_, NA_real_, NA_real_, 240080, NA_real_, NA_real_,NA_real_), INTX2 = c(276790, NA_real_, 7612645, NA_real_,NA_real_, NA_real_, 5078219, NA_real_, NA_real_, NA_real_), INTX173 = c(NA_real_, NA_real_, NA_real_, 3456878,NA_real_, NA_real_, 3289778, NA_real_, NA_real_, NA_real_), INTX4 = c(NA_real_, NA_real_, 11198767, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 7025676), KAT = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))

As you see, I have eight columns called: INT1:INT4 and INTX1:INTX4. For each row there are only a maximum of four values for these variables and the rest are NAs. I need to create four new variables called ING1:ING4 and tell R to check the 8 columns one by one per row and assign the first value it finds in that row to ING1, the second value to ING2, the third value to ING3, and the fourth value to ING4.At the end, it is possible that, for a row, all or some of the ING1:ING4 columns are filled with values. I would expect for row 1 I get the following ING columns:

ING1 == 245005, ING2 == 276790, ING3 == NA, ING4 ==NA

I think I need to write a loop for that but as I am a beginner I am lost how to do it. Could you kindly help me with it?


Solution

  • Try this:

    fun <- function(select, prefix = "ING", ncol = -1, data = cur_data()) {
      select <- substitute(select)
      out <- asplit(t(
        apply(subset(data, select = eval(select)), 1, function(z) z[order(is.na(z))])
        ), 2)
      names(out) <- paste0(prefix, seq_along(out))
      if (ncol > 0) out <- out[seq_len(ncol)]
      do.call(data.frame, out)
    }
    

    Parts:

    • apply(.., MARGIN=1, fun) will call the function fun across each row (MARGIN=1) of the data.
    • when applying a function across rows, apply returns a seemingly transposed matrix, so we transpose it back to the dims we expect
    • ultimately we need the returned value to be a list (one column/vector per element), so we take the matrix from t(apply(...)) and asplit it by column (okay, so perhaps I could have removed the t and just asplit on rows ... *shrug*)
    • the substitute(select) and eval(select) portion are to facilitate the "tidy-select" nature of calling it as we do, e.g., INT1:INT4
    • out[seq_len(ncol)] filters the results to the first four columns as set by the ncol= argument (we use ncol=4)

    And its use:

    dplyr

    library(dplyr)
    DATA %>%
      mutate(fun(INT1:INTX4, ncol=4))
    # # A tibble: 10 × 18
    #    ID    IDA   DATE                    NR      PAT        INT1  INT2   INT3  INT4  INTX1   INTX2 INTX173    INTX4   KAT    ING1    ING2    ING3     ING4
    #    <chr> <chr> <dttm>                  <chr>   <chr>     <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>    <dbl> <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
    #  1 101   1     2011-03-24 01:00:00.000 CH-0001 101-1    245005    NA     NA    NA     NA  276790      NA       NA     0  245005  276790      NA       NA
    #  2 101   1     2010-07-01 01:00:00.000 CH-0001 101-1    280040    NA     NA    NA 275040      NA      NA       NA     0  280040  275040      NA       NA
    #  3 101   2     2010-07-01 01:00:00.000 CH-0002 101-2    280040    NA 280010    NA     NA 7612645      NA 11198767     0  280040  280010 7612645 11198767
    #  4 101   3     2010-07-08 01:00:00.000 CH-0003 101-3    280040    NA     NA    NA     NA      NA 3456878       NA     0  280040 3456878      NA       NA
    #  5 101   4     2010-07-08 01:00:00.000 CH-0004 101-4    280040    NA     NA    NA     NA      NA      NA       NA     0  280040      NA      NA       NA
    #  6 101   5     2016-01-26 01:00:00.000 CH-0005 101-5    240040    NA     NA    NA     NA      NA      NA       NA     0  240040      NA      NA       NA
    #  7 101   5     2012-02-23 01:00:00.000 CH-0005 101-5    240040    NA     NA    NA 240080 5078219 3289778       NA     0  240040  240080 5078219  3289778
    #  8 101   1859  2017-09-27 01:00:00.000 CH-1859 101-1859     NA    NA     NA    NA     NA      NA      NA       NA     1      NA      NA      NA       NA
    #  9 101   1860  2017-01-23 01:00:00.000 CH-1860 101-1860     NA    NA 245035    NA     NA      NA      NA       NA     1  245035      NA      NA       NA
    # 10 101   1861  2017-01-23 01:00:00.000 CH-1861 101-1861     NA    NA     NA    NA     NA      NA      NA  7025676     1 7025676      NA      NA       NA
    

    base R

    cbind(DATA, fun(data = DATA, INT1:INTX4, ncol=4))
    #     ID  IDA                DATE      NR      PAT   INT1 INT2   INT3 INT4  INTX1   INTX2 INTX173    INTX4 KAT    ING1    ING2    ING3     ING4
    # 1  101    1 2011-03-24 01:00:00 CH-0001    101-1 245005   NA     NA   NA     NA  276790      NA       NA   0  245005  276790      NA       NA
    # 2  101    1 2010-07-01 01:00:00 CH-0001    101-1 280040   NA     NA   NA 275040      NA      NA       NA   0  280040  275040      NA       NA
    # 3  101    2 2010-07-01 01:00:00 CH-0002    101-2 280040   NA 280010   NA     NA 7612645      NA 11198767   0  280040  280010 7612645 11198767
    # 4  101    3 2010-07-08 01:00:00 CH-0003    101-3 280040   NA     NA   NA     NA      NA 3456878       NA   0  280040 3456878      NA       NA
    # 5  101    4 2010-07-08 01:00:00 CH-0004    101-4 280040   NA     NA   NA     NA      NA      NA       NA   0  280040      NA      NA       NA
    # 6  101    5 2016-01-26 01:00:00 CH-0005    101-5 240040   NA     NA   NA     NA      NA      NA       NA   0  240040      NA      NA       NA
    # 7  101    5 2012-02-23 01:00:00 CH-0005    101-5 240040   NA     NA   NA 240080 5078219 3289778       NA   0  240040  240080 5078219  3289778
    # 8  101 1859 2017-09-27 01:00:00 CH-1859 101-1859     NA   NA     NA   NA     NA      NA      NA       NA   1      NA      NA      NA       NA
    # 9  101 1860 2017-01-23 01:00:00 CH-1860 101-1860     NA   NA 245035   NA     NA      NA      NA       NA   1  245035      NA      NA       NA
    # 10 101 1861 2017-01-23 01:00:00 CH-1861 101-1861     NA   NA     NA   NA     NA      NA      NA  7025676   1 7025676      NA      NA       NA
    

    Edited to only sort on NA-ness, retaining the original order of non-null values.