I have the following df in R:
ID GENDER COUNTRY
1 M US
2 M UK
3 F JPN
4 F NED
There are over 50 different countries, I want to summarize this info as follows. If the person is from the top 10 most popular countries (popular countries are the countries with the most records), COUNTRY_POPULAR will be 1, else 0. Ex US and UK happened to be in the top 10 frequent in this df and JPN and NED were not:
ID GENDER COUNTRY COUNTRY_POPULAR
1 M US 1
2 M UK 1
3 F JPN 0
4 F NED 0
In base R, we can use table
to count the occurrence of each country
, sort
them select the top 10 countries using tail
and assign 1/0 values based on their presence/absence.
df$COUNTRY_POPULAR <- +(df$COUNTRY %in% names(tail(sort(table(df$COUNTRY)), 10)))
The +
ahead converts the logical values TRUE
/FALSE
to 1/0 respectively.