Search code examples
rdplyrglm

Add new column with 0 and 1 depending on a numeric value in column x with mutate


I want to add a column to predict with a glm high costs. I use the code:

 df %>%
      mutate(high_costs = case_when(Totalcosts>=4000~"1",
                                     Totalcosts<4000~"0"
                                     ))

This gives me the right values apparently, but Now I have 2 questions:

  1. How can I add this column actually to my df?

  2. Is it possible (by using another code) to make the output numeric in stead of factor, because I will predict 0 or 1 in my glm. Or do I have to use a code like

    df$y <- as.numeric(as.factor(df$high_costs))


Solution

  • Oh yes.

    1. You just need to reassign it to a new variable (or if you wish to go full rambo - reassign to df again, though I would strongly advise against this).
    df_1 = df %>%
          mutate(high_costs = case_when(Totalcosts>=4000~"1",
                                         Totalcosts<4000~"0"
                                         ))
    

    You could also have used ifelse() syntax as well, but I do enjoy the SQL cross over with the case when usage too.

    1. Yes. First off, the easiest way. Drop the quotes.
    df_1 = df %>%
          mutate(high_costs = case_when(Totalcosts>=4000~1,
                                         Totalcosts<4000~0
                                         ))
    

    R will recognize these as numeric values.

    A second approach, however, would be a little daisy chaining. This is needed given what R is actually doing when it makes a character or numeric into a factor (https://www.guru99.com/r-factor-categorical-continuous.html#:~:text=Factor%20in%20R%20is%20a,integer%20data%20values%20as%20levels. - Note the second sentence in the highlighted portion)

    So, you could do in multiple steps:

    df %>%
          mutate(high_costs = case_when(Totalcosts>=4000~"1",
                                         Totalcosts<4000~"0"
                                         ),
                 high_costs = as.character(high_costs),
                 high_costs = as.numeric(high_costs)) 
        
    

    Or, wrap all it once, which is harder on the eye, but requires less code.

    df_1 = df %>%
          mutate(high_costs = as.numeric(as.character(case_when(Totalcosts>=4000~1,
                                         Totalcosts<4000~0
                                         ))))
    
    

    'df$y <- as.numeric(as.factor(df$high_costs))' will not work they way you wish, unless you provide a better reason as to why you want a numeric factor value, something that is already being done by R by merit it of it being a factor. I strongly suggest you investigate the differences between characters & factors in R to gain further understanding as to why.