Search code examples
rmappingconcatenationpurrrmissing-data

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?


I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).

I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:

  1. Why is map_df() doing this and how can I avoid it? OR
  2. How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).

Here's my code

library(tidyverse)

#Set up - generate sample dataset and get all combinations of grades and subjects

student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
                         subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
                         grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))

all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3", 
                "4", "5")

subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)

all_combos <- subjects_and_letter_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names = str_replace_all(names, "\\|", "_")) %>%
  pull(names)


# iterate over each combination using map_df()
student_map <- map_df(all_combos,
                        ~student_grades %>%
                          mutate("{.x}" := paste(i)) %>%
                          group_by(student_id) %>%
                          mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
                                                           str_detect(.x, grade) ~ 1,
                                                         TRUE ~ 0), na.rm = T)))

EDIT For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.

student_map <- student_grades
for(i in all_combos) {
  student_map <- student_map %>%
    mutate("{i}" := paste(i)) %>%
    group_by(student_id) %>%
    mutate("{i}" := sum(case_when(str_detect(i, subject) &
                                    str_detect(i, grade) ~ 1,
                                  TRUE ~ 0), na.rm = T)) 
}

Solution

  • There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end

    library(dplyr)
    library(purrr)
    library(stringr)
    student_map2 <- map_dfc(all_combos,
      ~ student_grades %>% 
      transmute(subject, grade, student_id, "{.x}" := .x) %>% 
      group_by(student_id) %>%  
      transmute("{.x}" := sum(case_when(str_detect( .x, subject) & 
          str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
      ungroup %>% 
      select(-student_id)) %>% 
       bind_cols(student_grades, .)
    

    -checking with OP's for loop output

    > all.equal(student_map, student_map2, check.attributes = FALSE)
    [1] TRUE