Search code examples
rlogictidyversesequencedplyr

Test for set inclusion and processing data simultaneously in tidyverse


I almost have what I need. I need some help with the last detail! The data set is produced by the following:

stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))

test.dat1 <- as.data.frame(cbind(Student, College))

I am using the following code to create what I need

library(dplyr)

set.seed(29)
test.dat2 <- test.dat1 %>% 
  group_by(Student, .drop=F) %>% #group by student
  mutate(semester= sequence(n())) %>% #set semester sequence
  summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
            seq_home=min(which(College %in% sctcs)), # add column of sequence values
            new_school= if_else(n_distinct(College) > 1, 
            first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student. 

it produces the following table

enter image description here

I want the NA's to be filled in with the last college for that student. I don't know how to get rid of the NA's. If you know an easier way to produce the same thing please share the knowledge.


Solution

  • This ought to do it:

    test.dat2 <- test.dat1 |> 
      mutate(semester= sequence(n())) |>
      arrange(Student, semester) |> # find this a more intuitive order
      group_by(Student, .drop=F) |>
      # Additional mutate step for clarity & simplicity
      mutate(seq_home = min(which(College %in% sctcs))) |>
      summarise(home_school = College[seq_home],
                new_school = 
                  College[
                    coalesce(
                      first(which(!(College %in% sctcs) & semester > seq_home)),
                      seq_home,
                      length(College))
                      ]
                )
    

    We're indexing College with coalesce(), which returns the first non-missing value from it's arguments. Initially, we look for first non-sctcs college they attended after attending home_school. If that returns NA (i.e. there is no such college), we just return seq_home, to get the last sctcs college they attended. If that returns NA (as would be the case if they had never attended any sctcs colleges), we return length(College), which of course subsets College to give us the last college they attended.

    I'm still not 100% clear on whether this does exactly what you want - I don't know if you'd considered the case where there were no sctcs colleges. There are none on this seed, but it could easily have happened.