Search code examples
rdataframedplyr

Rowwise mean of randomly selected non-NA columns


I have a data frame, where for each row, I would like to randomly sample three columns (which three columns can differ between rows) and take the mean of those three sampled values. As an additional problem, I have many rows which are completely NA (and I cannot delete them due to other reasons) or only contain 1 or 2 non-NA values. Based on this question and answer, I tried the following:

df_new <- df %>%
  rowwise %>%
  mutate(inflo_mean = mean(sample(na.omit(c_across(everything())), 3)))

This doesn't work and I get an error about the use of sample():

Error in `mutate()`:
ℹ In argument: `inflo_mean = mean(sample(na.omit(c_across(everything())), 3))`.
ℹ In row 1.
Caused by error in `sample.int()`:
! invalid first argument

I then tried to break it down into smaller steps and handle the different NA cases seperately and came up with this:

df_new2 <- df %>%
  rowwise() %>%
  mutate(num_NAs = sum(!is.na(across(starts_with("Col_")))),
         v_inflo = list(na.omit(c_across((starts_with("Col_"))))),
         inflo_mean = case_when(num_NAs > 2 ~ mean(sample(v_inflo, 3)),
                                  num_NAs == 2 ~ mean(v_inflo),
                                  num_NAs == 1 ~ as.numeric(v_inflo),
                                  num_NAs == 0 ~ NA_real_,
                                  TRUE ~ NA_real_))

Again, this doesn't work either and I get the same error. I checked the data type of the columns, they are all integers. What could be the problem here? Or is there any other solution?

Example data:

> dput(df)
structure(list(Col_1 = c(NA, 77L, 82L, 172L), Col_2 = c(NA, 79L, 
NA, 135L), Col_3 = c(NA, 81L, NA, 131L), Col_4 = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), Col_5 = c(NA, NA, NA, 
33L), Col_6 = c(NA, NA, NA, 104L), Col_7 = c(NA, NA, NA, 106L
), Col_8 = c(NA, NA, NA, 93L), Col_9 = c(NA, NA, NA, 50L), Col_10 = c(NA, 
NA, NA, 48L), Col_11 = c(NA, NA, NA, 96L), Col_12 = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), Col_13 = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), Col_14 = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), Col_15 = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -4L))

Solution

  • You can use mapply like below

    df$inflo_mean <-
      mapply(
        \(x, k) mean(na.omit(c(x))[sample.int(sum(!is.na(x)), k)]),
        asplit(df, 1),
        pmin(rowSums(!is.na(df)), 3)
      )
    

    and you will see that

    > print(as.data.frame(df))
      Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10 Col_11 Col_12
    1    NA    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
    2    77    79    81    NA    NA    NA    NA    NA    NA     NA     NA     NA
    3    82    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA
    4   172   135   131    NA    33   104   106    93    50     48     96     NA
      Col_13 Col_14 Col_15 inflo_mean
    1     NA     NA     NA        NaN
    2     NA     NA     NA   79.00000
    3     NA     NA     NA   82.00000
    4     NA     NA     NA   85.66667