removing gibberish from sentences

During text cleaning, is it possible to detect and remove junk like this from sentences:

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

currently I'm doing something like this:

str_detect(x, pattern = 'Thisisaverylongexample'))

but the more I review my dataframe, I found more sentences with this type of junk. How do I use something like regex to detect and remove rows with something junk like this?

Solution

If 'junk' is detectable via its unusual length, you can define a rule accordingly. For example, if you want to get rid of words of 10 or more characters, this would extract them:

library(stringr)
str_extract_all(x, "\\b\\w{10,}\\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"           

[[2]]
[1] "thisisjustjunk"

and this would get rid of them:

trimws(gsub("\\b\\w{10,}\\b", "", x))
[1] "and I was to"         "but I do I remove it"

Data:

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")