I have a dataframe with Twitter bios formatted like the table below.
account | bio |
---|---|
38374 | i love candy as much as life itself proud liberal |
45673 | can all just get along |
94928 | conserv christian mom and proud pro trump veteran maga |
11204 | professor of women and gender studies at wesleyan university blacklivesmatter |
37465 | former ohio state football coach now a proud papa to seven grandchildren |
A number of responses on stack overflow ask how to remove a specified list of words from a dataframe column (like R - remove word from a sentence and How to remove words of a sentence by using a dictionary as reference).But I want to remove ALL words in the bio column UNLESS they are found in a pre-determined list of words. The list of words to keep is made up of 1052 words (as seen below)
> termstokeep
[1] love life follow live just like music regist trademark
[10] make fan one copyright lover thing world time god
[19] can get design peopl artist girl univers writer will
[28] student work busi good new know friend famili best
[37] day account market sport art game manag want book
[46] enthusiast person alway travel never free real help dream
[55] servic mom husband profession beauti offici wife now news
[64] social food come father heart educ develop need anim
[73] everyth proud tri year happi also media way man
[82] team produc look state take back support director home
[91] find call engin learn provid photograph great author video
[100] guy communiti coach name big passion see teacher school
[109] product sinc gamer enjoy keep player better let believ
[118] mother think mind dog futur give colleg say owner
[127] jesus fun got littl chang founder boy use first
[136] liberal write footbal kid fuck event polit consult care
[145] conserv much health technolog tech opinion stay everi right
[154] full former member special well young high creat snap
[163] entrepreneur movi feel view compani coffe cat citi human
[172] digit show singer sometim interest dad watch scienc creativ
[181] blogger base addict fit read bless fashion part noth
[190] run forev editor born hard die around onlin nerd
[199] class web musician made stuff leader ever inspir still
[208] christian place current public danc pleas geek talk film
[217] realli babi someth page rock lot women lead two
Ideally, after all non-specified words are removed, the dataframe would look something like this:
account | bio |
---|---|
38374 | love life proud liberal |
45673 | |
94928 | conserv christian mom proud pro trump veteran maga |
11204 | professor women gender university blacklivesmatter |
37465 | ohio state football coach proud grandchildren |
How can accomplish this?
Here is another base R option:
df$bio <- sapply(lapply(strsplit(df$bio, "\\s"), intersect, termstokeep),
paste, collapse = " ")
Output
account bio
1 38374 love much life proud liberal
2 45673 can just get
3 94928 conserv christian mom proud
4 11204 women
5 37465 former state coach now proud
Data (thanks @RuiBarradas!)
df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
), bio = c("i love candy as much as life itself proud liberal",
"can all just get along", "conserv christian mom and proud pro trump veteran maga",
"professor of women and gender studies at wesleyan university blacklivesmatter",
"former ohio state football coach now a proud papa to seven grandchildren"
)), class = "data.frame", row.names = c(NA, -5L))
termstokeep <- c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")