Search code examples
rdataframemissing-datadummy-variable

Create dummy for missing values in numeric variable in r


I have the following data:

PassengerId Survived Pclass    Sex Age SibSp Parch    Fare Embarked
1           1        0      3   male  22     1     0  7.2500        S
2           2        1      1 female  38     1     0 71.2833        C
3           3        1      3 female  26     0     0  7.9250        S
4           4        1      1 female  35     1     0 53.1000        S
5           5        0      3   male  35     0     0  8.0500        S
6           6        0      3   male  NA     0     0  8.4583        Q

Now, when I use the dummy or dummy.data.frame, I can successfully convert factors (here Sex and Embarked)to dummies like this:

PassengerId Survived Pclass Sexfemale Sexmale Age SibSp Parch    Fare Embarked EmbarkedC EmbarkedQ EmbarkedS
1           1        0      3         0       1  22     1     0  7.2500        0         0         0         1
2           2        1      1         1       0  38     1     0 71.2833        0         1         0         0
3           3        1      3         1       0  26     0     0  7.9250        0         0         0         1
4           4        1      1         1       0  35     1     0 53.1000        0         0         0         1
5           5        0      3         0       1  35     0     0  8.0500        0         0         0         1
6           6        0      3         0       1  NA     0     0  8.4583        0         0         1         0

Now, if how can I apply this on Age column where it's creating more than 100 dummies, one for each unique age entry and one for NA. I want the output to be like

Age   Age.NA
22    0 
38    0
......
35    0
0     1

It is automatically treating missing values as a different entry and creating a variable for it in case of factors, but I want to achieve the same in case of numeric variables without hampering already existing values in the column. Please help.


Solution

  • You can just use:

    df$Age.NA <- ifelse(is.na(df$Age), 1, 0)
    

    And then:

    library(dummies)
    dummy.data.frame(df)
    

    Output:

      PassengerId Survived Pclass Sexfemale Sexmale Age SibSp Parch    Fare EmbarkedC EmbarkedQ EmbarkedS Age.NA
    1           1        0      3         0       1  22     1     0  7.2500         0         0         1      0
    2           2        1      1         1       0  38     1     0 71.2833         1         0         0      0
    3           3        1      3         1       0  26     0     0  7.9250         0         0         1      0
    4           4        1      1         1       0  35     1     0 53.1000         0         0         1      0
    5           5        0      3         0       1  35     0     0  8.0500         0         0         1      0
    6           6        0      3         0       1  NA     0     0  8.4583         0         1         0      1
    

    Data:

    df <- structure(list(PassengerId = 1:6, Survived = c(0L, 1L, 1L, 1L, 
    0L, 0L), Pclass = c(3L, 1L, 3L, 1L, 3L, 3L), Sex = structure(c(2L, 
    1L, 1L, 1L, 2L, 2L), .Label = c("female", "male"), class = "factor"), 
        Age = c(22L, 38L, 26L, 35L, 35L, NA), SibSp = c(1L, 1L, 0L, 
        1L, 0L, 0L), Parch = c(0L, 0L, 0L, 0L, 0L, 0L), Fare = c(7.25, 
        71.2833, 7.925, 53.1, 8.05, 8.4583), Embarked = structure(c(3L, 
        1L, 3L, 3L, 3L, 2L), .Label = c("C", "Q", "S"), class = "factor"), 
        Age.NA = c(0, 0, 0, 0, 0, 1)), .Names = c("PassengerId", 
    "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", 
    "Embarked", "Age.NA"), row.names = c("1", "2", "3", "4", "5", 
    "6"), class = "data.frame")