Search code examples
rdata-manipulationdata-cleaningdummy-variable

Create new Categorical variable based on a subset of data


I have a dataframe that looks like this:

         cnt    bnk qst ans
1  Country 1 Bank 1  q1   1
2  Country 2 Bank 2  q1   1
3  Country 3 Bank 3  q1   3
4  Country 4 Bank 4  q1   1
5  Country 1 Bank 1  q2   1
6  Country 2 Bank 2  q2   2
7  Country 3 Bank 3  q2   3
8  Country 4 Bank 4  q2   4
9  Country 1 Bank 1  q3   1
10 Country 2 Bank 2  q3   1
11 Country 3 Bank 3  q3   2
12 Country 4 Bank 4  q3   1

For your information, q's stand for "Question". So, q2 is "Question 2". Similarly, ans are the responses.

Now, I want to create a categorical variable based on the responses in q2. In particular I want to assign the following categories:

  1. Public
  2. Private
  3. mixed
  4. Other

So, if the ans=1 to qst=q2, this is "Public", if ans=2 to qst=q2 this is "Private" etc. So, my dataframe after this should look like this:

         cnt    bnk qst ans   dummy
1  Country 1 Bank 1  q1   1  Public
2  Country 2 Bank 2  q1   1 Private
3  Country 3 Bank 3  q1   3   Mixed
4  Country 4 Bank 4  q1   1  Other'
5  Country 1 Bank 1  q2   1  Public
6  Country 2 Bank 2  q2   2 Private
7  Country 3 Bank 3  q2   3   Mixed
8  Country 4 Bank 4  q2   4  Other'
9  Country 1 Bank 1  q3   1  Public
10 Country 2 Bank 2  q3   1 Private
11 Country 3 Bank 3  q3   2   Mixed
12 Country 4 Bank 4  q3   1  Other'

I tried to use the ifelse, but I failed to do what I want. Can someone give me some suggestions on how I can do it ?

DATA

dput(df)
structure(list(cnt = c("Country 1", "Country 2", "Country 3", 
"Country 4", "Country 1", "Country 2", "Country 3", "Country 4", 
"Country 1", "Country 2", "Country 3", "Country 4"), bnk = c("Bank 1", 
"Bank 2", "Bank 3", "Bank 4", "Bank 1", "Bank 2", "Bank 3", "Bank 4", 
"Bank 1", "Bank 2", "Bank 3", "Bank 4"), qst = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("q1", 
"q2", "q3"), class = "factor"), ans = c(1L, 1L, 3L, 1L, 1L, 2L, 
3L, 4L, 1L, 1L, 2L, 1L), dummy = c(NA, NA, NA, NA, "Public", 
"Private", "Mixed", "Other", NA, NA, NA, NA)), .Names = c("cnt", 
"bnk", "qst", "ans", "dummy"), row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

Solution

  • The follwing will put NA for all other Qs,

    df$dummy <- ifelse(df$ans == 1 & df$qst == 'q2', 'Public', 
                   ifelse(df$ans == 2 & df$qst == 'q2', 'Private', 
                       ifelse(df$ans == 3 & df$qst == 'q2', 'Mixed', 
                            ifelse(df$ans == 4 & df$qst == 'q2', 'Other', NA))))
    
    #         cnt    bnk qst ans   dummy
    #1  Country 1 Bank 1  q1   1    <NA>
    #2  Country 2 Bank 2  q1   1    <NA>
    #3  Country 3 Bank 3  q1   3    <NA>
    #4  Country 4 Bank 4  q1   1    <NA>
    #5  Country 1 Bank 1  q2   1  Public
    #6  Country 2 Bank 2  q2   2 Private
    #7  Country 3 Bank 3  q2   3   Mixed
    #8  Country 4 Bank 4  q2   4   Other
    #9  Country 1 Bank 1  q3   1    <NA>
    #10 Country 2 Bank 2  q3   1    <NA>
    #11 Country 3 Bank 3  q3   2    <NA>
    #12 Country 4 Bank 4  q3   1    <NA>