Search code examples
rdplyrrolling-computation

Count a categorical variable based on a group over time (the date column)


Say I have the following data:

date name rolename
2009-12-01 John helper
2010-12-01 John helper
2011-12-01 John senior helper
2012-12-01 John manager
2009-12-01 Will helper
2010-12-01 Will senior helper
2011-12-01 Will manager
2012-12-01 Will senior manager

I am trying to count the number of roles, based on the column rolename for the person in the name column, the person has worked to date. For example, for the above data, I want a fourth column which measures the number of positions a person has worked so far:

date name rolename nopositions
2009-12-01 John helper 1
2010-12-01 John helper 1
2011-12-01 John senior helper 2
2012-12-01 John manager 3
2009-12-01 Will helper 1
2010-12-01 Will senior helper 2
2011-12-01 Will manager 3
2012-12-01 Will senior manager 4

My failed attempts:

#attempt 1
library(dplyr)

data %>%
group_by(name) %>%
mutate(nopositions = count(rolename))

#attempt2
library(runner)

data %>%
group_by(name) %>%
mutate(nopositions = runner(x = rolename,
                            k = inf,
                            idx = date,
                            f = function(x) length(x))

Solution

  • Assuming that the order by date is assured,

    library(dplyr)
    quux %>%
      group_by(name) %>%
      mutate(noposition = cummax(match(rolename, unique(rolename)))) %>%
      ungroup()
    # # A tibble: 8 × 4
    #   date       name  rolename       noposition
    #   <chr>      <chr> <chr>               <int>
    # 1 2009-12-01 John  helper                  1
    # 2 2010-12-01 John  helper                  1
    # 3 2011-12-01 John  senior helper           2
    # 4 2012-12-01 John  manager                 3
    # 5 2009-12-01 Will  helper                  1
    # 6 2010-12-01 Will  senior helper           2
    # 7 2011-12-01 Will  manager                 3
    # 8 2012-12-01 Will  senior manager          4
    

    We might get away without cummax, except that if a name returns to a previous rolename, its noposition will decreases (revert to a previous value). However, we want to keep the most-recent max value.

    This does assume that unique preserves the natural order of the first-occurrences. If something goes amiss with this (I cannot think of something off-hand), we could do a window of words:

    quux %>%
      group_by(name) %>%
      mutate(noposition = sapply(seq_along(rolename), \(i) length(unique(rolename[1:i])))) %>%
      ungroup()
    # # A tibble: 8 × 4
    #   date       name  rolename       noposition
    #   <chr>      <chr> <chr>               <int>
    # 1 2009-12-01 John  helper                  1
    # 2 2010-12-01 John  helper                  1
    # 3 2011-12-01 John  senior helper           2
    # 4 2012-12-01 John  manager                 3
    # 5 2009-12-01 Will  helper                  1
    # 6 2010-12-01 Will  senior helper           2
    # 7 2011-12-01 Will  manager                 3
    # 8 2012-12-01 Will  senior manager          4
    

    This produces the same results here, and it will tend to perform more poorly with larger groups (as it is iterating a lot more). I offer it as an extension in case assumptions preclude the use of cummax(match(..)).