Search code examples
rdataframepivot-table

Counting start and end states with a pivot table in R


I have a table of data with studentIDs,TestingWindow, and BenchmarkCategories. Values for TestingWindow are "Start of Year" or "End of Year." Values for the TestingData are Urgent Intervention, Intervention, On Watch, At Benchmark, Above Benchmark.

Here's some sample data. Notice student 123 is missing a result of End of Year.

studentIDs ScreeningPeriodWindowName <chr> DistrictBenchmarkCategoryName <chr>
123 StartofYear Urgent Intervention
456 StartofYear Intervention
456 EndofYear At Benchmark
789 StartofYear On Watch
789 EndofYear Above Benchmark

I want to display the movement of students between BenchmarkCategories between the start and end of year. I'd like a table to show this. The key is to show changes for individual students, rather than total counts. "How many students moved from Urgent Intervention to On Watch?"

Here's what the table should look like, with Column A the start of year categories and Row 1 the End of Year Categories. (middle cells blank but I think you see the pattern)

v start    end> Urgent Intervention Intervention On Watch At Benchmark Above Benchmark
Urgent Intervention count of Urgent to Urgent count of Urgent to Intervention count of Urgent to On Watch Count of Urgent to At Benchmark count of Urgent to Above Benchmark
Intervention
On Watch
At Benchmark
Above Benchmark count of Above to Urgent count of Above to Intervention count of Above to On Watch count of Above to At Benchmark count of Above to Above

In Excel pivot tables this is trivial (defining rows and columns, count IDs) but I am having a hard time wrapping my head around it in terms of grouping and summarizing (or pivot-tabling) in R. Thanks.

I have tried pivot_wider and pivot_longer, but each only gets me the row or column headings and one set of counts.

progress <- reading_data %>%
  select(studentID,TestingWindow,BenchmarkCategories) %>%
  group_by(TestingWindow,BenchmarkCategories) %>%
  summarize(count=n_distinct(studentID)) %>%
  pivot_wider(names_from=TestingWindow,values_from=count)

progress

Solution

  • You can do this using two pivot_wider() calls:

    • first pivot_wider() to create separate columns for each value in TestingWindow e.g. StartofYear and EndofYear
    • create counts per StartofYear/EndofYear pairing, then pivot_wider() again so EndofYear values are pivoted to columns

    Note that if you want all BenchmarkCategories to be included in the result, you'll need an extra step as detailed below. I have also included a larger example dataset to illustrate results.

    First, with your sample data (with modified column names to match your code pipe column names):

    library(dplyr)
    library(tidyr)
    
    reading_data <- structure(list(studentID = c(123L, 456L, 456L, 789L, 789L), TestingWindow = c("StartofYear", 
    "StartofYear", "EndofYear", "StartofYear", "EndofYear"), BenchmarkCategories = c("Urgent Intervention", 
    "Intervention", "At Benchmark", "On Watch", "Above Benchmark"
    )), class = "data.frame", row.names = c(NA, -5L))
    
    progress <- reading_data |>
      pivot_wider(names_from = TestingWindow,
                  values_from = BenchmarkCategories) |>
      count(StartofYear, EndofYear, .drop = FALSE) |>
      pivot_wider(names_from = EndofYear,
                  values_from = n,
                  values_fill = 0) |>
      rename(`v start end >` = StartofYear)
    
    progress
    # # A tibble: 3 × 4
    #   `v start end >`     `At Benchmark` `Above Benchmark`  `NA`
    #   <chr>                        <int>             <int> <int>
    # 1 Intervention                     1                 0     0
    # 2 On Watch                         0                 1     0
    # 3 Urgent Intervention              0                 0     1
    

    Pairs not represented are dropped so not all BenchmarkCategories are present. Also, the NA column counts the incomplete records e.g. studentID == 123.

    If you want all BenchmarkCategories in the result:

    # Create vector of all unique BenchmarkCategories
    bmc <- unique(reading_data$BenchmarkCategories)
    
    progress <- reading_data |>
      pivot_wider(names_from = TestingWindow,
                  values_from = BenchmarkCategories) |>
      count(StartofYear, EndofYear, .drop = FALSE) |>
      complete(StartofYear = bmc,
               EndofYear = bmc,
               fill = list(n = 0)) |>
      pivot_wider(names_from = EndofYear,
                  values_from = n,
                  values_fill = 0) |>
      rename(`v start end >` = StartofYear)
      
    
    progress
    # # A tibble: 5 × 7
    #   StartofYear         `Above Benchmark` `At Benchmark` Intervention `On Watch` `Urgent Intervention`  `NA`
    #   <chr>                           <int>          <int>        <int>      <int>                 <int> <int>
    # 1 Above Benchmark                     0              0            0          0                     0     0
    # 2 At Benchmark                        0              0            0          0                     0     0
    # 3 Intervention                        0              1            0          0                     0     0
    # 4 On Watch                            1              0            0          0                     0     0
    # 5 Urgent Intervention                 0              0            0          0                     0     1
    

    Example using larger example dataset:

    set.seed(42)
    reading_data <- data.frame(
      studentID = c(123, rep(456:789, each = 2)),
      TestingWindow = c("StartofYear", rep(c("StartofYear", "EndofYear"), 334)),
      BenchmarkCategories = sample(c("Urgent Intervention","Intervention",
                                     "At Benchmark", "On Watch", "Above Benchmark"),
                                   669, replace = TRUE))
    
    progress <- reading_data |>
      pivot_wider(names_from = TestingWindow,
                  values_from = BenchmarkCategories) |>
      count(StartofYear, EndofYear, .drop = FALSE) |>
      # complete(StartofYear = bmc,
      #          EndofYear = bmc,
      #          fill = list(n = 0)) |>
      pivot_wider(names_from = EndofYear,
                  values_from = n,
                  values_fill = 0) |>
      rename(`v start end >` = StartofYear)
    
    progress
    # # A tibble: 5 × 7
    #   `v start end >`     `Above Benchmark` `At Benchmark` Intervention `On Watch` `Urgent Intervention`  `NA`
    #   <chr>                           <int>          <int>        <int>      <int>                 <int> <int>
    # 1 Above Benchmark                    12              8           13         13                     8     0
    # 2 At Benchmark                       14              9           11         12                    15     0
    # 3 Intervention                       18             17           18         15                    20     0
    # 4 On Watch                           14             10           14          7                    15     0
    # 5 Urgent Intervention                11             21           14         17                     8     1