I have a dataset with answers of user if they know a brand or not. Some of the users just answered nonsense, as you can see in my example.
meinstring <- c("----asdada", "no idea", "C&A", "aaaaaaaaaa", "---", "adaosdjasodajsdoad")
spamidenfifier <- function(x) {
verhaeltnis <- str_count(tolower(x), "[aeoiu]") / str_count(x)
sequenz <- sum(sequence(rle(as.character(data.frame(strsplit(as.character(x), ""))[,1]))$lengths) >= 3, na.rm = TRUE)
if(str_count(x) > 4) { weight <- 0.9 } else { weight <- 1 } ## Gewicht, weil unwahrscheinlicher bei längerem String
variation_buchstaben <- (length(unique(data.frame(strsplit(as.character(x), ""))[,1])) / str_count(x) * weight)
if(verhaeltnis < 0.2 | verhaeltnis > 0.8 | sequenz > 0 | variation_buchstaben < 0.5) {
return(TRUE)
} else {
return(FALSE)
}
}
sapply(meinstring, spamidenfifier)
Output:
----asdada no idea C&A aaaaaaaaaa --- adaosdjasodajsdoad
TRUE FALSE FALSE TRUE TRUE FALSE
My function does not work too bad, however there might be better solutions. Is there a package or better method to identify if a word was just misspelled or a person answered nonsense. If not, suggestions to improve that function are highly appreciated!
edit: Updated some improvements :-)
Just my spontaneous idea:
meinstring <- c("----asdada", "no idea", "C&A", "aaaaaaaaaa", "---", "adaosdjasodajsdoad", "+-*-", "*-+-", "adfpdflrraaeea")
grepl('^\\W+$|(?:[-!@#$%^&*\\[\\]()";:_<>.,=+/ ]){2,}|[-!@#$%^&*\\[\\]()";:_<>.,=+/ ]{3,}|[aeoiu]{3,}',
meinstring , perl = T) & !grepl("iou|zweieiig", meinstring) # add the exceptions in the second grepl.
[1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
There is no neat perfect solution.