Search code examples
rextractgsubtidytext

R - delete length-one strings and stopwords (using tidytext) in character


If I have a df:

   Class sentence
1   Yes  there is p beaker on the table
2   Yes  they t the frown
3   Yes  so Z it was asleep


How do I remove the length-one strings within "sentence" column to remove things like "t" "p" and "Z", and then do a final clean using the stop_words list in tidytext to get the below?

   Class sentence
1   Yes  beaker table
2   Yes  frown
3   Yes  asleep

Solution

  • If we want to use tidytext, then create a sequence column (row_number()), then apply unnest_tokens on the sentence column, do an anti_join with the default data from get_stopwords(), filter out the words that have characters only 1, and then do a group by paste on the 'word' column to create the 'sentence'

    library(dplyr)
    library(tidytext)
    library(stringr)
    df %>% 
       mutate(rn = row_number()) %>%
       unnest_tokens(word, sentence) %>% 
       anti_join(get_stopwords()) %>% 
       filter(nchar(word) > 1) %>%
       group_by(rn, Class) %>%
       summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>% 
       select(-rn)
    

    -Output

    # A tibble: 3 x 2
      Class sentence    
      <chr> <chr>       
    1 Yes   beaker table
    2 Yes   frown       
    3 Yes   asleep      
    

    Data

    df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table", 
    "they t the frown", "so Z it was asleep")), 
    class = "data.frame", row.names = c("1", 
    "2", "3"))