If I have a df:
Class sentence
1 Yes there is p beaker on the table
2 Yes they t the frown
3 Yes so Z it was asleep
How do I remove the length-one strings within "sentence" column to remove things like "t" "p" and "Z", and then do a final clean using the stop_words list in tidytext to get the below?
Class sentence
1 Yes beaker table
2 Yes frown
3 Yes asleep
If we want to use tidytext
, then create a sequence column (row_number()
), then apply unnest_tokens
on the sentence
column, do an anti_join
with the default data from get_stopwords()
, filter
out the words that have characters only 1, and then do a group by paste
on the 'word' column to create the 'sentence'
library(dplyr)
library(tidytext)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, sentence) %>%
anti_join(get_stopwords()) %>%
filter(nchar(word) > 1) %>%
group_by(rn, Class) %>%
summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>%
select(-rn)
-Output
# A tibble: 3 x 2
Class sentence
<chr> <chr>
1 Yes beaker table
2 Yes frown
3 Yes asleep
df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table",
"they t the frown", "so Z it was asleep")),
class = "data.frame", row.names = c("1",
"2", "3"))