Search code examples
rconditional-statementsgrouping

How to add new columns to a dataframe filled with a value of another cell based on conditions per cluster/group?


I have a dataframe xmpl with the grouping or cluster variable ID. I want to add a new column a1_momo which is filled with one specific value from column a1 for each row per ID based on the following conditions:

'BM' == 1 & 'W2' == 0 & 'dt' == 1

that lead to one clear identified single match per cluster, except for no such row is existing.

xmpl <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,3),
                  BM = c(0,0,1,1,1,0,0,1,1,1),
                  W2 = c(0,0,0,0,1,0,0,0,1,0),
                  dt = c(1,2,1,2,1,1,2,2,1,1),
                  a1 = c(3,2,1,2,2,3,2,2,3,NA))
> xmpl
   ID BM W2 dt a1
1   1  0  0  1  3
2   1  0  0  2  2
3   1  1  0  1  1
4   1  1  0  2  2
5   1  1  1  1  2
6   2  0  0  1  3
7   2  0  0  2  2
8   2  1  0  2  2
9   2  1  1  1  3
10  3  1  0  1 NA

Solely xmpl[3,] fits the conditions for 'ID' == 1 so I want to fill the value of xmpl$a1[3] to each row of the cluster into the new column a1_momo. For 'ID' == 2 there is no match to conditions so therefore a1_momo should be filled with NA for the rows clustered in 'ID' == 2. Lastly, 'ID' == 3 contains a row matching all conditions but has no value for a1.


The result I wish to get:

> xmpl
   ID BM W2 dt a1 a1.momo
1   1  0  0  1  3       1
2   1  0  0  2  2       1
3   1  1  0  1  1       1
4   1  1  0  2  2       1
5   1  1  1  1  2       1
6   2  0  0  1  3      NA
7   2  0  0  2  2      NA
8   2  1  0  2  2      NA
9   2  1  1  1  3      NA
10  3  1  0  1 NA      NA

My limited knowledge in prorgramming with R led me to the following conclusion:

I could produce a new dataframe that is grouped for my cluster variable so I have only one row per ID. The only function I know to do nearly the thing I want, is aggregate but it needs a summary function for specification. Instead of that I need someting to define my conditions by which the new dataframe can be filled with the one specific value (or NA) per cluster, e.g. a ifelse-specification.

So there would be something like:

xmpl.2<-not.aggregate(xmpl[,'a1'],list(xmpl$ID),ifelse(...))

> xmpl.2
  Group.1 a1
1       1  1
2       2 NA
3       3 NA

Afterwards I imagined something like merge-ing it together by the grouping variable ID with the suffix .momo. (because of course the real data contains more than one "a1")

xmpl<-merge(xmpl,xmpl.2,by=1,suffixes=c("",".momo"))

My questions therefore are:

  • Is that one way to go? Which [package::function] do I need to use then instead of aggregate?
  • How do I fill in a new column with a specific value based on another cell when using ifelse-functions?

Days of research ended up with nothing so I'm typing this example. I guess there's a more elegant solution I can't see. Very grateful for help.


Solution

  • I bet there's a concise one-liner with base R, but here's a simple dplyr approach. I join the data to a version that extracts matching rows, one per ID/a1.mono. I'm not sure how you want to deal with potential situations where there are multiple matching rows within an ID, with different a1 values.

    library(dplyr)
    xmpl |> 
      left_join(
        xmpl |>
          filter(BM == 1 & W2 == 0 & dt == 1 & !is.na(a1)) |>
          mutate(a1.mono = a1) |>
          distinct(ID, a1.mono)
      )