Suppose I have the following data.frame
:
dt = tibble::tibble(
id_0 = rep(123, 6),
name_0 = rep("A", 6),
id_1 = c(rep(321, 3), rep(322, 3)),
name_1 = c(rep("B", 3), rep("C", 3)),
p_1 = c(rep(0.7, 3), rep(0.3, 3)),
id_2 = c(NA, 323:326, NA),
name_2 = c(NA, "D", "E", "J", "G", NA),
p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
id_3 = c(NA, NA, 323, NA, NA, NA),
na_3 = c(NA, NA, "H", NA, NA, NA),
p_3 = c(NA, NA, 1, NA, NA, NA),
)
Which looks like this:
# A tibble: 6 x 11
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr>
1 123 A 321 B 0.7 NA NA NA NA NA
2 123 A 321 B 0.7 323 D 0.8 NA NA
3 123 A 321 B 0.7 324 E 0.2 323 H
4 123 A 322 C 0.3 325 J 0.9 NA NA
5 123 A 322 C 0.3 326 G 0.1 NA NA
6 123 A 322 C 0.3 NA NA NA NA NA
I need to take the row-by-row product from al the p_*
columns. In this case it would be Product = p_1 * p_2 * p_3
, but in general it could be any product from p_1 to p_* (this data.frame
changes from case to case, I mean Product = product(p_1, p_2, ..., p_n)
). Note that p_*
is always greater than cero and less or equal than one (p_ > 0 & p_ <= 1). So the task I need to accomplish have two things: The Product
1) must omit the NAs
and 2) be general for any number of p_*
present in the data.frame
.
The desireble output should look like this:
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 J 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
I suggest a reshaping operation:
library(dplyr)
library(tidyr) # pivot_longer
# preserve a row-wise "id"
dt <- mutate(dt, rn = row_number())
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
# # A tibble: 24 x 6
# rn num id name p na
# <int> <chr> <dbl> <chr> <dbl> <chr>
# 1 1 0 123 A NA <NA>
# 2 1 1 321 B 0.7 <NA>
# 3 1 2 NA <NA> NA <NA>
# 4 1 3 NA <NA> NA <NA>
# 5 2 0 123 A NA <NA>
# 6 2 1 321 B 0.7 <NA>
# 7 2 2 323 D 0.8 <NA>
# 8 2 3 NA <NA> NA <NA>
# 9 3 0 123 A NA <NA>
# 10 3 1 321 B 0.7 <NA>
# # ... with 14 more rows
With this, we can easily group_by
, calculate the product, ...
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE))
# # A tibble: 6 x 2
# rn Product
# <int> <dbl>
# 1 1 0.7
# 2 2 0.560
# 3 3 0.140
# 4 4 0.27
# 5 5 0.03
# 6 6 0.3
... and then join it back in to dt
.
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE)) %>%
left_join(dt, ., by = "rn") %>%
select(-rn)
# # A tibble: 6 x 12
# id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
# <dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 123 A 321 B 0.7 NA <NA> NA NA <NA> NA 0.7
# 2 123 A 321 B 0.7 323 D 0.8 NA <NA> NA 0.560
# 3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.140
# 4 123 A 322 C 0.3 325 F 0.9 NA <NA> NA 0.27
# 5 123 A 322 C 0.3 326 G 0.1 NA <NA> NA 0.03
# 6 123 A 322 C 0.3 NA <NA> NA NA <NA> NA 0.3
(Side note: based on your comment of "any number of p_*
", it might make sense to keep your data in the longer format (out of pivot_longer
) for more of your processing.)