Search code examples
rdataframetexttwitterdata-cleaning

How to remove all unspecified specified words from dataframe column in R


I have a dataframe with Twitter bios formatted like the table below.

account bio
38374 i love candy as much as life itself proud liberal
45673 can all just get along
94928 conserv christian mom and proud pro trump veteran maga
11204 professor of women and gender studies at wesleyan university blacklivesmatter
37465 former ohio state football coach now a proud papa to seven grandchildren

A number of responses on stack overflow ask how to remove a specified list of words from a dataframe column (like R - remove word from a sentence and How to remove words of a sentence by using a dictionary as reference).But I want to remove ALL words in the bio column UNLESS they are found in a pre-determined list of words. The list of words to keep is made up of 1052 words (as seen below)

> termstokeep
   [1] love         life         follow       live         just         like         music        regist       trademark   
  [10] make         fan          one          copyright    lover        thing        world        time         god         
  [19] can          get          design       peopl        artist       girl         univers      writer       will        
  [28] student      work         busi         good         new          know         friend       famili       best        
  [37] day          account      market       sport        art          game         manag        want         book        
  [46] enthusiast   person       alway        travel       never        free         real         help         dream       
  [55] servic       mom          husband      profession   beauti       offici       wife         now          news        
  [64] social       food         come         father       heart        educ         develop      need         anim        
  [73] everyth      proud        tri          year         happi        also         media        way          man         
  [82] team         produc       look         state        take         back         support      director     home        
  [91] find         call         engin        learn        provid       photograph   great        author       video       
 [100] guy          communiti    coach        name         big          passion      see          teacher      school      
 [109] product      sinc         gamer        enjoy        keep         player       better       let          believ      
 [118] mother       think        mind         dog          futur        give         colleg       say          owner       
 [127] jesus        fun          got          littl        chang        founder      boy          use          first       
 [136] liberal      write        footbal      kid          fuck         event        polit        consult      care        
 [145] conserv      much         health       technolog    tech         opinion      stay         everi        right       
 [154] full         former       member       special      well         young        high         creat        snap        
 [163] entrepreneur movi         feel         view         compani      coffe        cat          citi         human       
 [172] digit        show         singer       sometim      interest     dad          watch        scienc       creativ     
 [181] blogger      base         addict       fit          read         bless        fashion      part         noth        
 [190] run          forev        editor       born         hard         die          around       onlin        nerd        
 [199] class        web          musician     made         stuff        leader       ever         inspir       still       
 [208] christian    place        current      public       danc         pleas        geek         talk         film        
 [217] realli       babi         someth       page         rock         lot          women        lead         two    

Ideally, after all non-specified words are removed, the dataframe would look something like this:

account bio
38374 love life proud liberal
45673
94928 conserv christian mom proud pro trump veteran maga
11204 professor women gender university blacklivesmatter
37465 ohio state football coach proud grandchildren

How can accomplish this?


Solution

  • Here is another base R option:

    df$bio <- sapply(lapply(strsplit(df$bio, "\\s"), intersect, termstokeep),
           paste, collapse = " ")
    

    Output

      account                          bio
    1   38374 love much life proud liberal
    2   45673                 can just get
    3   94928  conserv christian mom proud
    4   11204                        women
    5   37465 former state coach now proud
    

    Data (thanks @RuiBarradas!)

    df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
    ), bio = c("i love candy as much as life itself proud liberal", 
    "can all just get along", "conserv christian mom and proud pro trump veteran maga", 
    "professor of women and gender studies at wesleyan university blacklivesmatter", 
    "former ohio state football coach now a proud papa to seven grandchildren"
    )), class = "data.frame", row.names = c(NA, -5L))
    
    termstokeep <- c("love", "life", "follow", "live", "just", "like", "music", 
    "regist", "trademark", "make", "fan", "one", "copyright", "lover", 
    "thing", "world", "time", "god", "can", "get", "design", "peopl", 
    "artist", "girl", "univers", "writer", "will", "student", "work", 
    "busi", "good", "new", "know", "friend", "famili", "best", "day", 
    "account", "market", "sport", "art", "game", "manag", "want", 
    "book", "enthusiast", "person", "alway", "travel", "never", "free", 
    "real", "help", "dream", "servic", "mom", "husband", "profession", 
    "beauti", "offici", "wife", "now", "news", "social", "food", 
    "come", "father", "heart", "educ", "develop", "need", "anim", 
    "everyth", "proud", "tri", "year", "happi", "also", "media", 
    "way", "man", "team", "produc", "look", "state", "take", "back", 
    "support", "director", "home", "find", "call", "engin", "learn", 
    "provid", "photograph", "great", "author", "video", "guy", "communiti", 
    "coach", "name", "big", "passion", "see", "teacher", "school", 
    "product", "sinc", "gamer", "enjoy", "keep", "player", "better", 
    "let", "believ", "mother", "think", "mind", "dog", "futur", "give", 
    "colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang", 
    "founder", "boy", "use", "first", "liberal", "write", "footbal", 
    "kid", "fuck", "event", "polit", "consult", "care", "conserv", 
    "much", "health", "technolog", "tech", "opinion", "stay", "everi", 
    "right", "full", "former", "member", "special", "well", "young", 
    "high", "creat", "snap", "entrepreneur", "movi", "feel", "view", 
    "compani", "coffe", "cat", "citi", "human", "digit", "show", 
    "singer", "sometim", "interest", "dad", "watch", "scienc", "creativ", 
    "blogger", "base", "addict", "fit", "read", "bless", "fashion", 
    "part", "noth", "run", "forev", "editor", "born", "hard", "die", 
    "around", "onlin", "nerd", "class", "web", "musician", "made", 
    "stuff", "leader", "ever", "inspir", "still", "christian", "place", 
    "current", "public", "danc", "pleas", "geek", "talk", "film", 
    "realli", "babi", "someth", "page", "rock", "lot", "women", "lead", 
    "two")