Search code examples
rif-statementfor-loopnested-loopsna

How to use ifelse inside the for loop in R


I have to check the names of all variables in a data.frame and if match found, need to replace the NA values in that variable with Median, else for others replace NAs with mean.

The data.frame cyl_spec has 11 variables and I have to replace NA as below:

  1. Viscosity: Impute with median
  2. Wax: Impute with median
  3. Others: Impute with Mean

I can certainly do it by picking the variables one at a time but I was trying the following code :

attach(cyl_spec)
var <- colnames(cyl_spec)
for(val in var)
{
  if(val == 'viscosity'){viscosity[is.na(viscosity == T)] <- median(viscosity, na.rm = T)}
  else if(val == 'wax'){wax[is.na(wax == T)] <- median(wax, na.rm = T)}
  else {val[is.na(val == T)] <- mean(val, na.rm = T)}
}
detach(cyl_spec)

Somehow the code is not doing anything and I am still getting the same no of NA in the variable using this command :

sum(is.na(cyl_spec$viscosity) 

Also, when I run this code I get the following warning message :

Warning messages:
1: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
2: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
3: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
4: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
5: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
6: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
7: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
8: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA
9: In mean.default(val, na.rm = T) :
  argument is not numeric or logical: returning NA

Could someone please help me with finding the solution for this, am stuck! Thanks in advance!!


Solution

  • You do not need a loop to do this. Moreover, the correct syntax to test for na values is is.na(var), not is.na(var == TRUE). Finally, if you want to avoid typing the name of your dataframe, you would need to use some function that does it (like with or the dplyr functions). Here, R is looking for an object named viscosity that is nowhere to be found because it is the name of a column inside cyl_spec (same for the other variable names).

    cyl_spec$viscosity[is.na(cyl_spec$viscosity)] <- median(cyl_spec$viscosity, na.rm = T)
    cyl_spec$wax[is.na(cyl_spec$wax)] <- median(cyl_spec$wax, na.rm = T)
    cyl_spec$val[is.na(cyl_spec$val)] <- mean(cyl_spec$val, na.rm = T)
    

    If all you need is to deal with this data.frame and only those three variables, I strongly recommend you stick to this base-r solution. If, however, you are looking to do this on a data frame with more variables and you want to automate it, you could look into the dplyr::mutate_each. Here is an example with simulated data.

    We create a data.frame with 7 variables and assign some NA values.

    library(dplyr)
    
    set.seed(10)
    df <- data.frame(n=runif(100),
                     m=runif(100),
                     d=runif(100),
                     o=runif(100),
                     e=runif(100),
                     f=runif(100),
                     g=runif(100))
    
    df <- mutate_each(df,funs(ifelse(.>.8,NA,.)))
    
    head(df)
    
               n          m         d           o         e          f         g
    1 0.50747820 0.34434350 0.2230884 0.347860110        NA         NA        NA
    2 0.30676851 0.06132255 0.5358950 0.007992606 0.6855115         NA 0.7478783
    3 0.42690767 0.36897981 0.6625291 0.401344915 0.6296311         NA 0.7225419
    4 0.69310208 0.40759356        NA 0.588350693 0.7508252 0.29063776 0.5457709
    5 0.08513597         NA 0.1491831          NA        NA 0.07203601 0.2641231
    6 0.22543662         NA 0.6700994 0.708542599 0.3600703 0.55888842 0.3057243
    

    Now, we apply to each variable a function to infer NA values from either mean or median :

    df <- df %>%
    ## Which variables are to be recoded with mean? here, n and m
      mutate_each(funs(ifelse(is.na(.),mean(.,na.rm = TRUE),.)),n,m) %>% 
    ## Which variables are to be recoded with median? here, d,o,e,f,g
      mutate_each(funs(ifelse(is.na(.),median(.,na.rm = TRUE),.)),d,o,e,f,g)
    
    head(df)
    
               n          m         d           o         e          f         g
    1 0.50747820 0.34434350 0.2230884 0.347860110 0.3602354 0.39956699 0.4499041
    2 0.30676851 0.06132255 0.5358950 0.007992606 0.6855115 0.39956699 0.7478783
    3 0.42690767 0.36897981 0.6625291 0.401344915 0.6296311 0.39956699 0.7225419
    4 0.69310208 0.40759356 0.4407363 0.588350693 0.7508252 0.29063776 0.5457709
    5 0.08513597 0.40892568 0.1491831 0.378731867 0.3602354 0.07203601 0.2641231
    6 0.22543662 0.40892568 0.6700994 0.708542599 0.3600703 0.55888842 0.3057243