Search code examples
rdplyriterationsummary

R & dplyr: Assigning group-level characteristics to selected group members


I have a large data set divided into many small groups by a grouping variable = grp; all members of a group are contiguous in the order of the larger data set. The members of a group each have an id code (= id) and are numbered sequentially from 1. Within a group, some members meet a logical criterion = is_child. Each member has a variable (momloc) that contains either zero or the ID number of another of another group member (the mother if present).

I wish to assign to each individual in the data set the number of group members who has momloc equal to their ID, and zero if none do. I am trying to do this in dplyr as I have the groups set up there, and I have code that works, but it is a Rube Goldberg contraption of nested ifelse functions that adds two additional columns for intermediate values, one of which contains a vector, goes through the data set three times, and is incredibly slow. There has to be a better way than that. I'm getting tangled in the different syntax for mutate, working on rows, and summary, working on groups.

Below is a simplified data set and desired outcome

grp      <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)
id       <- c(1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7)
is_child <- c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0)
momloc   <- c(0, 0, 2, 2, 0, 0, 0, 3, 2, 2, 2)
data <- tibble(grp, id, is_child, momloc)

desired output:

out = c(0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0)

Solution

  • I propose a solution only using dplyr.

    First, I only keep children (assuming that you only want to count them as your out[6] = 2 instead of 3). Then, I create a frequency table of momloc using count(), and merge this to the original data.

    data %>%
        filter(is_child == 1) %>% # only count for children
        group_by(grp) %>%
        count(momloc) %>%
        right_join(data, by = c("grp" = "grp", "momloc" = "id")) %>%
        rename(
            id = momloc,
            momloc = momloc.y,
            out = n
        ) %>%
        mutate(out = ifelse(is.na(out), 0, out))
    
    #> # A tibble: 11 x 5
    #> # Groups:   grp [2]
    #>      grp    id   out is_child momloc
    #>    <dbl> <dbl> <dbl>    <dbl>  <dbl>
    #>  1     1     1     0        0      0
    #>  2     1     2     2        0      0
    #>  3     1     3     0        1      2
    #>  4     1     4     0        1      2
    #>  5     2     1     0        0      0
    #>  6     2     2     2        0      0
    #>  7     2     3     1        0      0
    #>  8     2     4     0        1      3
    #>  9     2     5     0        1      2
    #> 10     2     6     0        1      2
    #> 11     2     7     0        0      2