Search code examples
rdataframedplyrfeature-engineering

How can I generalize a lot of categorical variable in R?


I have the following df in R:

ID      GENDER        COUNTRY
1         M             US
2         M             UK
3         F             JPN
4         F             NED

There are over 50 different countries, I want to summarize this info as follows. If the person is from the top 10 most popular countries (popular countries are the countries with the most records), COUNTRY_POPULAR will be 1, else 0. Ex US and UK happened to be in the top 10 frequent in this df and JPN and NED were not:

ID      GENDER        COUNTRY         COUNTRY_POPULAR 
1         M             US                   1
2         M             UK                   1
3         F             JPN                  0
4         F             NED                  0

Solution

  • In base R, we can use table to count the occurrence of each country, sort them select the top 10 countries using tail and assign 1/0 values based on their presence/absence.

    df$COUNTRY_POPULAR <- +(df$COUNTRY %in% names(tail(sort(table(df$COUNTRY)), 10)))
    

    The + ahead converts the logical values TRUE/FALSE to 1/0 respectively.