Search code examples
rdataframeconditional-statementsplyrcompletion

How to fill column with another on conditions applied on both columns


I am trying to perform a very simple data completion : I have two columns of the same measurement made in two different locations a few miles away from each other. Location 1 is more complete than location 2, and I want to complete the second with the first, by applying a coefficient (loc1/loc2) to 1.

My problem is that it is about weather values, that vary with the day considered, so applying the same mean coefficient to all my values would not be ideal. I first got mean coefficients by the values of loc1 and the day, but for the days with no loc2 values, I would like to :

  • replace NA in 'coef' value by a coef calculated with a same loc1 value
  • but there are sometime more than one coef for each loc1 value, so when there is more than one, it would be preferentially the one measured on the same day
  • and when there is no coef value at all for one loc1 value, then coef would be the same than for that loc1 value +/- an interval (the coef of the closest loc1 value on the closest date to the one to be completed) Here is what it looks like :

      loc1 Date      Coef   loc2
    1      12 204        3      4
    2      8  147        4      2
    3      8  204        NA    NA
    4      10 147        NA    NA
    5      10 158        NA    NA
    6      6  159        3      2
    7      6  162        NA    NA
    8      6  170        2      3
    9      .3 175 0.4833333 0.145
    10     0.3 204        NA    NA
    11     0.4 146        NA    NA
    12     0.4 147        NA    NA
    

I am pretty sure there is a non messy way of doing that, but with my limited knowledge using ifelse or ddply, I didn't get any closer to what I want in the end. I feel a loop could do the trick, but I have no idea how..

Any thoughts would be greatly appreciated ! many thanks in advance !


Solution

  • I don't understand exactly what you mean in bullet 3:

    when there is no coef value at all for one loc1 value, then coef would be the same than for that loc1 value +/- an interval (the coef of the closest loc1 value on the closest date to the one to be completed)

    So I don't have an answer for that part, but this might get you some of the way there:


    library(dplyr)
    
    df <- tibble::tribble(
      ~loc1, ~Date,     ~Coef, ~loc2,
      12,  204L,         3,     4,
      8,  147L,         4,     2,
      8,  204L,        NA,    NA,
      10,  147L,        NA,    NA,
      10,  158L,        NA,    NA,
      6,  159L,         3,     2,
      6,  162L,        NA,    NA,
      6,  170L,         2,     3,
      3,  175L, 0.4833333, 0.145,
      0.3,  204L,        NA,    NA,
      0.4,  146L,        NA,    NA,
      0.4,  147L,        NA,    NA
    )
    
    df %>% 
      # Replace Coef with the coef of same loc1 and same day
      group_by(loc1, Date) %>% 
      mutate(Coef = if_else(!is.finite(Coef), mean(Coef, na.rm = TRUE), Coef)) %>% 
      # For ones without same day and loc1, use the average of all days at loc1
      group_by(loc1) %>% 
      mutate(Coef = if_else(!is.finite(Coef), mean(Coef, na.rm = TRUE), Coef)) %>% 
      ungroup() %>% 
      # Then complete the loc2 with using the completed Coef and loc1
      mutate(loc2 = if_else(!is.finite(loc2), loc1 * Coef, loc2))
    
    #> # A tibble: 12 x 4
    #>     loc1  Date      Coef   loc2
    #>    <dbl> <int>     <dbl>  <dbl>
    #>  1  12.0   204 3.0000000  4.000
    #>  2   8.0   147 4.0000000  2.000
    #>  3   8.0   204 4.0000000 32.000
    #>  4  10.0   147       NaN    NaN
    #>  5  10.0   158       NaN    NaN
    #>  6   6.0   159 3.0000000  2.000
    #>  7   6.0   162 2.5000000 15.000
    #>  8   6.0   170 2.0000000  3.000
    #>  9   3.0   175 0.4833333  0.145
    #> 10   0.3   204       NaN    NaN
    #> 11   0.4   146       NaN    NaN
    #> 12   0.4   147       NaN    NaN