I have a data frame, where for each row, I would like to randomly sample three columns (which three columns can differ between rows) and take the mean of those three sampled values. As an additional problem, I have many rows which are completely NA (and I cannot delete them due to other reasons) or only contain 1 or 2 non-NA values. Based on this question and answer, I tried the following:
df_new <- df %>%
rowwise %>%
mutate(inflo_mean = mean(sample(na.omit(c_across(everything())), 3)))
This doesn't work and I get an error about the use of sample()
:
Error in `mutate()`:
ℹ In argument: `inflo_mean = mean(sample(na.omit(c_across(everything())), 3))`.
ℹ In row 1.
Caused by error in `sample.int()`:
! invalid first argument
I then tried to break it down into smaller steps and handle the different NA cases seperately and came up with this:
df_new2 <- df %>%
rowwise() %>%
mutate(num_NAs = sum(!is.na(across(starts_with("Col_")))),
v_inflo = list(na.omit(c_across((starts_with("Col_"))))),
inflo_mean = case_when(num_NAs > 2 ~ mean(sample(v_inflo, 3)),
num_NAs == 2 ~ mean(v_inflo),
num_NAs == 1 ~ as.numeric(v_inflo),
num_NAs == 0 ~ NA_real_,
TRUE ~ NA_real_))
Again, this doesn't work either and I get the same error. I checked the data type of the columns, they are all integers. What could be the problem here? Or is there any other solution?
Example data:
> dput(df)
structure(list(Col_1 = c(NA, 77L, 82L, 172L), Col_2 = c(NA, 79L,
NA, 135L), Col_3 = c(NA, 81L, NA, 131L), Col_4 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_5 = c(NA, NA, NA,
33L), Col_6 = c(NA, NA, NA, 104L), Col_7 = c(NA, NA, NA, 106L
), Col_8 = c(NA, NA, NA, 93L), Col_9 = c(NA, NA, NA, 50L), Col_10 = c(NA,
NA, NA, 48L), Col_11 = c(NA, NA, NA, 96L), Col_12 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_13 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_14 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_15 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
You can use mapply
like below
df$inflo_mean <-
mapply(
\(x, k) mean(na.omit(c(x))[sample.int(sum(!is.na(x)), k)]),
asplit(df, 1),
pmin(rowSums(!is.na(df)), 3)
)
and you will see that
> print(as.data.frame(df))
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10 Col_11 Col_12
1 NA NA NA NA NA NA NA NA NA NA NA NA
2 77 79 81 NA NA NA NA NA NA NA NA NA
3 82 NA NA NA NA NA NA NA NA NA NA NA
4 172 135 131 NA 33 104 106 93 50 48 96 NA
Col_13 Col_14 Col_15 inflo_mean
1 NA NA NA NaN
2 NA NA NA 79.00000
3 NA NA NA 82.00000
4 NA NA NA 85.66667