Search code examples
rbigdatanested-loopsmultiple-databases

Nested loop with using multiple data sets


I am having trouble setting up a nested loop for data cleaning. Each semester (Fall, Spring, Summer) has three data sets that need to be combined over multiple years (2018, 2019, 2020, 2021, 2022). The catch is that I don't need Spring and Summer 2018 because it's part of a different academic year. The data is not available for Summer 2023. I have the two groups of data sets available, Fall 2020, and Spring 2021 [https://drive.google.com/drive/folders/156RqWANAu2wL-Tr-x_wsdZ_qjEgM_5jM?usp=share_link][1]. The goal is to have each semester's data combined and cleaned. I have the two groups of data sets, Fall 2020, and Spring 2021

Semester = c("FA","SP", "SU")
Year = c("18","19","20","21","22")

for(Sem in Semester)
{
  for(Yr in Year){
    
    if((Sem== "SU" & Yr=="2022") | (Sem=="SP" & Yr== "2018")| (Sem=="SU" & Yr=="2018"))
      next
      
Prep_[[Sem,Yr]] <- bind_rows(ASPH_Grad_[[Sem, Yr]], ID_Grad_[[Sem, Yr]]) %>%
  left_join(., PhGrad_[[Sem, Yr]], by= c("ID"="BannerID")) %>% 
  distinct(ID, Program, .keep_all = T) %>% 
  mutate(New_Deg= case_when(is.na(PHGRAD.Degree)== F~ PHGRAD.Degree,
                            is.na(PHGRAD.Degree)== T~ Degree,
                            TRUE~ "Error")) %>%  
  rowwise() %>% 
  mutate(racecount= sum(c_across(`Race-Am Ind`:`Race- Caucasian`)== "Y", na.rm=T)) %>% 
  ungroup() %>% 
  mutate(racecode= case_when(Citizenship %in% list("NN", "NV") ~ "foreign_national",
                             `Race- Hispanic`== "Y" ~ "hispanic_latino", 
                              racecount >1 ~ "two_or_more_races",
                             `Race-Am Ind`== "Y"  ~ "american_indian_alaskan_native",
                             `Race- Asian`== "Y"  ~ "asian",
                             `Race-Afr Amer`== "Y"  ~ "black_african_american",
                             `Race- Hawaiian` == "Y"  ~ "native_hawaiian_pacific_islander",
                             `Race- Caucasian`== "Y" ~ "white",
                             `Race-Not Rept`== "Y" ~ "race_unknown",
                             TRUE~ "race_unknown"),
         gender_long= case_when(Gender== "F"~ "Female",
                                Gender== "M"~ "Male",
                                Gender== "N"~ "Other",
                                TRUE~ "other"),
         DEPT= case_when(Program %in% list("3GPH363AMS", "3GPH363AMSP", "3GPH378AMCD", "3GPH378AMS", "3GPH379APHD")~ "COMD",
                         Program %in% list("3GPH593AMPH", "3GPH593AMS", "3GPH593APHD", "3GPH569ACGS")~ "ENHS",
                         Program %in% list("3GPH596AMS", "3GPH596AMSPH", "3GPH596APHD","3GPH594AMPH", "3GPH594AMS", "3GPH594AMSPH", "3GPH594APHD", "3GPH586APBAC")~ "EPID/BIOS", 
                         Program %in% list("3GPH331AMS","3GPH331APHD","3GPH334AMS","3GPH335ADPT", "3GPH377AMS", "3GPH388AMS", "3GPH588AMPH", "3GPHJ331MS", "3UPH331ABS")~ "EXSC",
                         Program %in% list("3GPH592AMPH", "3GPH592APHD", "3GPH576CGS", "3GPH121CGS", "3GID635CGS")~ "HPEB",
                         Program %in% list("3GPH591AMPH", "3GPH591APHD", "3GPH597AMHA")~ "HSPM",
                         TRUE~ "Missing"), 
         degree_delivery_type= case_when(`First Concentration`== "R999" | `Second Concentration`== "R999" ~ "Distance-based",
                                         `First Concentration`== "3853" | `Second Concentration`== "3853" ~ "Executive", 
                                         TRUE~ "Campus-based"), 
         
         FTE_compute= case_when(Level== "GR" & `Course Hours`<9 ~ `Course Hours`/9,
                                Level== "GR" & `Course Hours`>=9~ 1,
                                Level== "UG" & `Course Hours`<12~ `Course Hours`/12,
                                Level== "UG" & `Course Hours`>=12 ~ 1),
         Full_Part_Status=case_when((Level== "GR" & `Course Hours` <9)| (Level== "UG" & `Course Hours`<12)~"parttime_status",
                             (Level=="GR" & `Course Hours`>=9)|(Level== "UG" & `Course Hours`>=12)~"fulltime_status",
         TRUE~ "other"),
         AcademicYear= paste(Sem,"_",YR),
         StudentCount= 1)
Dat_[[Sem, YR]] 

  }
}


Solution

  • There are a couple of problems with the code. First, a list can only take a single index, so something like X[[Sem, Yr]] won't work because you're passing two indices to the sub-list extractor [[ ]]. Alternatively, you may be trying to make something like Prep_Fa_20 with Prep_[[Sem, Yr]], though this isn't how R works. If you were trying to make that string, you could use paste('Prep", Sem, Yr, sep="_"), but you can't have a paste statement on the left-hand side of the assignment arrow. You could use the assign() function, but I think there's a better way.

    Another problem with the code is that PHGRAD.Degree doesn't exist in the code. I just changed the Degree variable in the PhGrad data frame to PHGRAD.Degree which may or may not be what you intended. Also, the variables to make FTE_compute and Full_Part_Status` do not exist in the data so I just commented those out.

    So, I would first suggest grabbing the file names from the folder. If the names are all as well formatted as the ones you have in the folder you shared, this will have some nice side-effects. You can identify the semester, year and group (ASPH, ID, PhGrad) from the file name. Then you can make a single row per semester-year where the ASPH, ID and PhGrad files are identified.

    library(dplyr)
    library(tidyr)
    setwd("~/Downloads/Stack Overflow/")
    
    d <- tibble(
      filename = list.files(), 
      Sem = gsub(".*(Fall|Spring|Summer).*", "\\1", filename), 
      Year = gsub(".*(\\d{2}).*", "\\1", filename), 
      grp = gsub(".*(ASPH|ID|PhGrad).*", "\\1", filename)
    ) %>% 
      pivot_wider(names_from = "grp", values_from="filename")
    

    Then, you can initialize an output object (res here), make the names of res from the corresponding semester and year elements from d and then loop over the rows of d (which is the same as the length of res). You read in the ASPH, ID and PhGrad files. Then, into the i-th element of the list, you save the merged and re-coded data.

    res <- vector(mode="list", length=nrow(d))
    names(res) <- paste(d$Sem, d$Year, sep="_")
    for(i in seq_along(res)){
      ASPH <- rio::import(d$ASPH[i])
      ID <- rio::import(d$ID[i])
      PhGrad <- rio::import(d$PhGrad[i])
      
      res[[i]] <- bind_rows(ASPH, ID) %>%
        left_join(., PhGrad %>% rename(PHGRAD.Degree = Degree), by= c("ID"="BannerID")) %>% 
        distinct(ID, Program, .keep_all = T) %>% 
        mutate(New_Deg= case_when(is.na(PHGRAD.Degree)== F~ PHGRAD.Degree,
                                  is.na(PHGRAD.Degree)== T~ Degree,
                                  TRUE~ "Error")) %>%  
        rowwise() %>% 
        mutate(racecount= sum(c_across(`Race-Am Ind`:`Race- Caucasian`)== "Y", na.rm=T)) %>% 
        ungroup() %>% 
        mutate(racecode= case_when(Citizenship %in% list("NN", "NV") ~ "foreign_national",
                                   `Race- Hispanic`== "Y" ~ "hispanic_latino", 
                                   racecount >1 ~ "two_or_more_races",
                                   `Race-Am Ind`== "Y"  ~ "american_indian_alaskan_native",
                                   `Race- Asian`== "Y"  ~ "asian",
                                   `Race-Afr Amer`== "Y"  ~ "black_african_american",
                                   `Race- Hawaiian` == "Y"  ~ "native_hawaiian_pacific_islander",
                                   `Race- Caucasian`== "Y" ~ "white",
                                   `Race-Not Rept`== "Y" ~ "race_unknown",
                                   TRUE~ "race_unknown"),
               gender_long= case_when(Gender== "F"~ "Female",
                                      Gender== "M"~ "Male",
                                      Gender== "N"~ "Other",
                                      TRUE~ "other"),
               DEPT= case_when(Program %in% list("3GPH363AMS", "3GPH363AMSP", "3GPH378AMCD", "3GPH378AMS", "3GPH379APHD")~ "COMD",
                               Program %in% list("3GPH593AMPH", "3GPH593AMS", "3GPH593APHD", "3GPH569ACGS")~ "ENHS",
                               Program %in% list("3GPH596AMS", "3GPH596AMSPH", "3GPH596APHD","3GPH594AMPH", "3GPH594AMS", "3GPH594AMSPH", "3GPH594APHD", "3GPH586APBAC")~ "EPID/BIOS", 
                               Program %in% list("3GPH331AMS","3GPH331APHD","3GPH334AMS","3GPH335ADPT", "3GPH377AMS", "3GPH388AMS", "3GPH588AMPH", "3GPHJ331MS", "3UPH331ABS")~ "EXSC",
                               Program %in% list("3GPH592AMPH", "3GPH592APHD", "3GPH576CGS", "3GPH121CGS", "3GID635CGS")~ "HPEB",
                               Program %in% list("3GPH591AMPH", "3GPH591APHD", "3GPH597AMHA")~ "HSPM",
                               TRUE~ "Missing"), 
               degree_delivery_type= case_when(`First Concentration`== "R999" | `Second Concentration`== "R999" ~ "Distance-based",
                                               `First Concentration`== "3853" | `Second Concentration`== "3853" ~ "Executive", 
                                               TRUE~ "Campus-based"), 
               
               # FTE_compute= case_when(Level== "GR" & `Course Hours`<9 ~ `Course Hours`/9,
               #                        Level== "GR" & `Course Hours`>=9~ 1,
               #                        Level== "UG" & `Course Hours`<12~ `Course Hours`/12,
               #                        Level== "UG" & `Course Hours`>=12 ~ 1),
               # Full_Part_Status=case_when((Level== "GR" & `Course Hours` <9)| (Level== "UG" & `Course Hours`<12)~"parttime_status",
               #                            (Level=="GR" & `Course Hours`>=9)|(Level== "UG" & `Course Hours`>=12)~"fulltime_status",
               #                            TRUE~ "other"),
               AcademicYear= paste0(d$Sem[i],"_",d$Year[i]),
               StudentCount= 1)  
    }
    

    You can then pull any element of the list according to its name (e.g., res[["Fall_20"]]. Here's the head of all the list elements.

    lapply(res, head)
    #> $Fall_20
    #> # A tibble: 6 × 37
    #>   ID      Gender Race-…¹ Race-…² Race-…³ Race-…⁴ Race-…⁵ Race-…⁶ Race-…⁷ Citiz…⁸
    #>   <chr>   <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
    #> 1 E30838… F      N       N       N       N       N       Y       N       CV     
    #> 2 E30838… F      N       N       N       N       N       Y       N       CV     
    #> 3 A25905… M      Y       N       N       N       N       Y       N       CV     
    #> 4 A25905… M      Y       N       N       N       N       Y       N       CV     
    #> 5 Y91985… M      N       N       N       N       N       Y       N       CV     
    #> 6 W54850… F      N       N       N       N       N       Y       N       CV     
    #> # … with 27 more variables: Program <chr>, `Program Desc` <chr>, Major <chr>,
    #> #   `Major Desc` <chr>, `First Minor` <lgl>, `First Minor Desc` <lgl>,
    #> #   `Second Minor` <lgl>, `Second Minor Desc` <lgl>,
    #> #   `First Concentration` <chr>, `First Concentration Desc` <chr>,
    #> #   `Second Concentration` <chr>, `Second Concentration Desc` <chr>,
    #> #   Degree <chr>, `Degree Desc` <chr>, Level <chr>, Term <chr>,
    #> #   PHGRAD.Degree <chr>, MatriculationTerm <chr>, MatriculationYear <int>, …
    #> 
    #> $Spring_21
    #> # A tibble: 6 × 39
    #>   ID      Gender Race-…¹ Race-…² Race-…³ Race-…⁴ Race-…⁵ Race-…⁶ Race-…⁷ Citiz…⁸
    #>   <chr>   <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
    #> 1 S68407… F      N       N       N       N       N       Y       N       CV     
    #> 2 T18720… F      N       N       N       N       N       Y       N       CV     
    #> 3 A80693… F      N       N       N       Y       N       N       N       CV     
    #> 4 X94989… F      N       N       N       N       N       Y       N       CV     
    #> 5 W82342… F      N       N       N       Y       N       N       N       CV     
    #> 6 N22004… M      N       N       N       Y       N       N       N       NV     
    #> # … with 29 more variables: Program <chr>, `Program Desc` <chr>, Major <chr>,
    #> #   `Major Desc` <chr>, `First Minor` <lgl>, `First Minor Desc` <lgl>,
    #> #   `Second Minor` <lgl>, `Second Minor Desc` <lgl>,
    #> #   `First Concentration` <chr>, `First Concentration Desc` <chr>,
    #> #   `Second Concentration` <chr>, `Second Concentration Desc` <chr>,
    #> #   Degree <chr>, `Degree Desc` <chr>, Level <chr>, Term <chr>, College <chr>,
    #> #   `College Desc` <chr>, PHGRAD.Degree <chr>, MatriculationTerm <chr>, …
    

    Created on 2023-05-23 with reprex v2.0.2