Search code examples
rnlpdata-cleaningtf-idfword-frequency

Remove words that occur only once and with low IDF in R


I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:

1) remove words that occur only once 2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently

This is an example of the data:

head(stormfront_data$stormfront_self_content)

Output:

[1] "        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!"
[2] "bonjour      warm  brother !   forward  speaking     !"                                                                                                                      
[3] " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4] "  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[5] " , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[6] "  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"                                                                                                           

Any help would be greatly appreciated, as I am not too familiar with R.


Solution

  • Here is an approach with tidytext

    library(tidytext)
    library(dplyr)
    word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
      unnest_tokens(word, text) %>%
      count(document, word, sort = TRUE)
    
    total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
      unnest_tokens(word, text) %>%
      group_by(word) %>% 
      summarize(total = n()) 
    
    words <- left_join(word_count,total_count)
    
    words %>%
      bind_tf_idf(word, document, n)
    # A tibble: 111 x 7
       document word             n total     tf   idf tf_idf
          <int> <chr>        <int> <int>  <dbl> <dbl>  <dbl>
     1        1 stormfront      10    11 0.139  1.10  0.153 
     2        1 networking       3     3 0.0417 1.79  0.0747
     3        1 site             3     6 0.0417 0.693 0.0289
     4        1 board            2     2 0.0278 1.79  0.0498
     5        1 forums           2     3 0.0278 1.10  0.0305
     6        1 introduction     2     2 0.0278 1.79  0.0498
     7        1 local            2     2 0.0278 1.79  0.0498
     8        1 main             2     3 0.0278 1.10  0.0305
     9        1 member           2     3 0.0278 1.10  0.0305
    10        1 online           2     2 0.0278 1.79  0.0498
    # … with 101 more rows
    

    From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.

    Data

    data <- structure(c("        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!", 
    "bonjour      warm  brother !   forward  speaking     !", " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         ", 
    "  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification.", 
    " , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed.", 
    "  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"
    ), .Dim = c(6L, 1L))