Search code examples
rnapipingtidyrspread

R - tidyr - spread() - dealing with NA as column name


I am spreading multiple categorical variables to Boolean columns using tidyr::spread(). As the data contains NAs, spread creates a new column without a name.

What I'm looking for is a way to get rid off the NAs using

a) a piping solution (I've tried select_() and '['(), but don't know how to refer to the NA column's name or index) or

b) a custom function, which would be even better

c) a way to simply not generate the NA columns, Hadleyverse compatible, if possible.

Below is my current (and very inelegantly repetitive) solution.

library(tidyr)
library(dplyr)

test <- data.frame(id = 1:4, name = c("anna", "bert", "charles", "daniel"),
                   flower = as.factor(c("rose", "rose", NA, "petunia")),
                   music = as.factor(c("pop","classical", "rock", NA)),
                   degree = as.factor(c(NA, "PhD", "MSc", "MSc")))

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = flower, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = music, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = degree, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test

Solution

  • We can use select with backquotes for the "NA" column.

     test %>% 
        mutate(truval= TRUE) %>% 
        spread(flower, truval, fill=FALSE) %>% 
        select(-`NA`)
     #  id    name     music degree petunia  rose
     #1  1    anna       pop   <NA>   FALSE  TRUE
     #2  2    bert classical    PhD   FALSE  TRUE
     #3  3 charles      rock    MSc   FALSE FALSE
     #4  4  daniel      <NA>    MSc    TRUE FALSE
    

    I guess it is difficult to not generate the NA column as the observations in other columns are tied to it. We could use filter with is.na to remove the row that has 'NA' in the 'flower' column, but then we will lose one row ie. the 3rd row.