Search code examples
rcsvdplyrdata-manipulationdata-cleaning

na.string in read.csv and ifelse function in dplyr::mutate


I used 2 ways to import a data (in CSV format) into R. The first method does not have the na.string argument, while the second one has. I used the second one because some of the strings appeared as "" instead of NA after the import, and that I want to standardise all missing values as NA.

data1<-read.csv("file.csv",stringsAsFactors=FALSE)
data2<-read.csv("file",stringsAsFactors=FALSE,na.string="")

I have 3 variables that are indicators. They give "X" for yes and ""/NA for no. I tried to apply the following function to both data1 and data2 above.

df1<-data1%>%
     mutate(Indicator_Institution=ifelse(Indicator_A=="X",1,
                                  ifelse(Indicator_B=="X",2,
                                  ifelse(Indicator_C=="X",3,NA))))
df2<-data2%>%
     mutate(Indicator_Institution=ifelse(Indicator_A=="X",1,
                                  ifelse(Indicator_B=="X",2,
                                  ifelse(Indicator_C=="X",3,NA))))

df1's ifelse function went through all the conditions, while df2 only run the first condition. Any idea why? What difference does the argument na.string="" make?

Reproducible example:

    > dput(droplevels(head(data1)))
structure(list(Indicator_A = c("X", "X", "X", "X", "", ""), 
    Indicator_B = c("", "", "", "", "X", "X"), Indicator_C = c("", 
    "", "", "", "", "")), .Names = c("Indicator_A", "Indicator_B", 
"Indicator_C"), row.names = c(NA, 6L), class = "data.frame")

> dput(droplevels(head(data2)))
structure(list(Indicator_A = c("X", "X", "X", "X", NA, NA), 
    Indicator_B = c(NA, NA, NA, NA, "X", "X"), Indicator_C = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), .Names = c("Indicator_A", "Indicator_B", 
"Indicator_C"), row.names = c(NA, 6L), class = "data.frame")

Solution

  • The reason is that for the second case instead of blanks, we have NA. If we use ==, the NA values will remain as such. To make those values as FALSE, use the & with !is.na

    data2 %>% 
        mutate(Indicator_Institution = ifelse(Indicator_A == "X" & !is.na(Indicator_A), 1, 
                                       ifelse(Indicator_B=="X" & !is.na(Indicator_B), 2,
                                       ifelse(Indicator_C == "X" & !is.na(Indicator_C), 3, 
                      NA))))
    

    Based on the example provided, this can be done easily with which

    which(!is.na(data2), arr.ind=TRUE)[,2]