Search code examples
rdummy-datarecode

R: Recoding several characters into one new factor


I am new to R, and could not find specific help for my question on this site.

I have (among others) ten character variables in my dataframe $grant_database, country_1 through country_10. Each contains either a country code, for example E20, F27 or G10, or an NA. Each case is a grant to a project. The ten country variables specify which country/countries a grant is benefitting. In my dataframe, most, but not all cases will have at least one country code, first marked in country_1, many will have one for country_2 as well, and some even for country_3 to _10. All empty fields are marked with an NA.

id  country_1  country_2  country_3  country_4  country_5  country_6 ...new_binaryvar
1   F20        NA         NA         NA         NA         NA           0        
2   E12        E17        E52        NA         NA         NA           0
3   O62        O33        NA         NA         NA         NA           0
4   E21        E20        NA         NA         NA         NA           1
5   NA         NA         NA         NA         NA         NA           0
...

I wish to create a new factor flagging grants which benefit a defined subset of countries. This binary "dummy" variable should give the value "1" to each case that in at least one of the ten country variables corresponds with a list of country codes. It should give "0" to each case/grant that does not have a corresponding country code in any of its ten country variables. Let this subset of country codes to be flagged be: E20, F27 and G10 (in reality, there are about 40 to be flagged, from 150+).

Would you help me out by suggesting a way to program this? Thank you very much for your help!


Solution

  • Assuming that you wanted to check whether a subset of "countrycodes" are there in each of the "country" variables with the condition that if atleast one of the "countrycode" is present in a particular row, that row will get "1", or else "0". The idea is to create a vector (v1) of "countrycodes" that needs to be checked. Convert the dataset (df) to matrix after removing the "id" column (as.matrix(df[,-1])) and then create a logical vector by comparing with "v1" (%in%). The vector can be changed back to "matrix" by assigning the dimensions (dim<-) to dimension of df[,-1] ie (c(5,7)). Do the rowSums, double negate (!!), finally add 0 to get the binary dummy variable.

     v1 <- c('E20', 'F27', 'G10')
    (!!rowSums(`dim<-`(as.matrix(df[,-1]) %in% v1, c(5,7))))+0
    #[1] 0 0 0 1 0
    

    newdata

    df <- structure(list(id = 1:5, country_1 = c("F20", "E12", "O62", "E21", 
    NA), country_2 = c(NA, "E17", "O33", "E20", NA), country_3 = c(NA, 
     "E52", NA, NA, NA), country_4 = c(NA, NA, NA, NA, NA), country_5 = c(NA, 
    NA, NA, NA, NA), country_6 = c(NA, NA, NA, NA, NA), country_7 = c(NA, 
    NA, NA, NA, NA)), .Names = c("id", "country_1", "country_2", 
    "country_3", "country_4", "country_5", "country_6", "country_7"
     ), class = "data.frame", row.names = c(NA, -5L))