During text cleaning, is it possible to detect and remove junk like this from sentences:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")
currently I'm doing something like this:
str_detect(x, pattern = 'Thisisaverylongexample'))
but the more I review my dataframe, I found more sentences with this type of junk. How do I use something like regex to detect and remove rows with something junk like this?
If 'junk' is detectable via its unusual length, you can define a rule accordingly. For example, if you want to get rid of words of 10 or more characters, this would extract them:
library(stringr)
str_extract_all(x, "\\b\\w{10,}\\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"
[[2]]
[1] "thisisjustjunk"
and this would get rid of them:
trimws(gsub("\\b\\w{10,}\\b", "", x))
[1] "and I was to" "but I do I remove it"
Data:
x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")