Search code examples
rtext-mining

removing gibberish from sentences


During text cleaning, is it possible to detect and remove junk like this from sentences:

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

currently I'm doing something like this:

str_detect(x, pattern = 'Thisisaverylongexample'))

but the more I review my dataframe, I found more sentences with this type of junk. How do I use something like regex to detect and remove rows with something junk like this?


Solution

  • If 'junk' is detectable via its unusual length, you can define a rule accordingly. For example, if you want to get rid of words of 10 or more characters, this would extract them:

    library(stringr)
    str_extract_all(x, "\\b\\w{10,}\\b")
    [[1]]
    [1] "Thisisaverylongexample" "removeitnow"           
    
    [[2]]
    [1] "thisisjustjunk"
    

    and this would get rid of them:

    trimws(gsub("\\b\\w{10,}\\b", "", x))
    [1] "and I was to"         "but I do I remove it"
    

    Data:

    x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")