Search code examples

Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

I have the following data frame

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

From a previous coding help, we can remove stop words using the following code.

report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b', 
                          collapse = '|'), '', report$Text)
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

I want to remove words less than certain character length (for example, want to remove words less than 4 characters such as hei and hey). Plus need to remove manual stop words (for example, saw and kitty) and common noises (whitespaces, numbers, and punctuations) before tokenization. The final outcome would be:

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                                   wood  4
5                             hello best  5

A similar question regarding noise and manual stop words is posted here.


  • With the previous code, if we start with removal of words that have nchar less than or equal to 3 (with gsubfn) it should work

    trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"), 
       tm::stopwords("english")), collapse="|"), ")\\b"), "", 
         gsub("[[:punct:]0-9]+", "",gsubfn("\\w+", function(x) 
         if(nchar(x) > 3) x else '', report$Text))))))


    [1] "unit crosses street"    "driver speeding driver" 
    [3] "year year pandemic"     "wood"                   "hello best"