Search code examples
rdataframerowdplyr

How to calculate the column product for specific columns matching a pattern and meet some conditions, row-by-row?


Suppose I have the following data.frame:

dt = tibble::tibble(
  id_0 = rep(123, 6),
  name_0 = rep("A", 6),
  id_1 = c(rep(321, 3), rep(322, 3)),
  name_1 = c(rep("B", 3), rep("C", 3)),
  p_1 = c(rep(0.7, 3), rep(0.3, 3)),
  id_2 = c(NA, 323:326, NA),
  name_2 = c(NA, "D", "E", "J", "G", NA),
  p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
  id_3 = c(NA, NA, 323, NA, NA, NA),
  na_3 = c(NA, NA, "H", NA, NA, NA),
  p_3 = c(NA, NA, 1, NA, NA, NA),
)

Which looks like this:

# A tibble: 6 x 11
   id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3 
  <dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr>
1   123 A        321 B        0.7    NA NA      NA      NA NA   
2   123 A        321 B        0.7   323 D        0.8    NA NA   
3   123 A        321 B        0.7   324 E        0.2   323 H    
4   123 A        322 C        0.3   325 J        0.9    NA NA   
5   123 A        322 C        0.3   326 G        0.1    NA NA   
6   123 A        322 C        0.3    NA NA      NA      NA NA 

I need to take the row-by-row product from al the p_* columns. In this case it would be Product = p_1 * p_2 * p_3, but in general it could be any product from p_1 to p_* (this data.frame changes from case to case, I mean Product = product(p_1, p_2, ..., p_n)). Note that p_* is always greater than cero and less or equal than one (p_ > 0 & p_ <= 1). So the task I need to accomplish have two things: The Product1) must omit the NAs and 2) be general for any number of p_* present in the data.frame.

The desireble output should look like this:

# A tibble: 6 x 12
   id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3    p_3 Product
  <dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr> <dbl>   <dbl>
1   123 A        321 B        0.7    NA NA      NA      NA NA       NA    0.7 
2   123 A        321 B        0.7   323 D        0.8    NA NA       NA    0.56
3   123 A        321 B        0.7   324 E        0.2   323 H         1    0.14
4   123 A        322 C        0.3   325 J        0.9    NA NA       NA    0.27
5   123 A        322 C        0.3   326 G        0.1    NA NA       NA    0.03
6   123 A        322 C        0.3    NA NA      NA      NA NA       NA    0.3 

Solution

  • I suggest a reshaping operation:

    library(dplyr)
    library(tidyr) # pivot_longer
    
    # preserve a row-wise "id"
    dt <- mutate(dt, rn = row_number())
    
    dt %>%
      pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
    # # A tibble: 24 x 6
    #       rn num      id name      p na   
    #    <int> <chr> <dbl> <chr> <dbl> <chr>
    #  1     1 0       123 A      NA   <NA> 
    #  2     1 1       321 B       0.7 <NA> 
    #  3     1 2        NA <NA>   NA   <NA> 
    #  4     1 3        NA <NA>   NA   <NA> 
    #  5     2 0       123 A      NA   <NA> 
    #  6     2 1       321 B       0.7 <NA> 
    #  7     2 2       323 D       0.8 <NA> 
    #  8     2 3        NA <NA>   NA   <NA> 
    #  9     3 0       123 A      NA   <NA> 
    # 10     3 1       321 B       0.7 <NA> 
    # # ... with 14 more rows
    

    With this, we can easily group_by, calculate the product, ...

    dt %>%
      pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
      group_by(rn) %>%
      summarize(Product = prod(p, na.rm = TRUE))
    # # A tibble: 6 x 2
    #      rn Product
    #   <int>   <dbl>
    # 1     1   0.7  
    # 2     2   0.560
    # 3     3   0.140
    # 4     4   0.27 
    # 5     5   0.03 
    # 6     6   0.3  
    

    ... and then join it back in to dt.

    dt %>%
      pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
      group_by(rn) %>%
      summarize(Product = prod(p, na.rm = TRUE)) %>%
      left_join(dt, ., by = "rn") %>%
      select(-rn)
    # # A tibble: 6 x 12
    #    id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3    p_3 Product
    #   <dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr> <dbl>   <dbl>
    # 1   123 A        321 B        0.7    NA <NA>    NA      NA <NA>     NA   0.7  
    # 2   123 A        321 B        0.7   323 D        0.8    NA <NA>     NA   0.560
    # 3   123 A        321 B        0.7   324 E        0.2   323 H         1   0.140
    # 4   123 A        322 C        0.3   325 F        0.9    NA <NA>     NA   0.27 
    # 5   123 A        322 C        0.3   326 G        0.1    NA <NA>     NA   0.03 
    # 6   123 A        322 C        0.3    NA <NA>    NA      NA <NA>     NA   0.3  
    

    (Side note: based on your comment of "any number of p_*", it might make sense to keep your data in the longer format (out of pivot_longer) for more of your processing.)