Cleaning up likert scale data: How to test whether data is consecutive in addition to some unrelated strings?

I need to clean up data that was collected with a likert scale. It means that observations in my data are from people who chose one option from an ordinal scale, such as "on a scale of 1-5, where 1 means awful and 5 means wonderful, how would you rate your liking of eggplants?"

Thus, a typical dataset will look like

library(tibble)

set.seed(123)
df_a <- 
  tibble(name = c("clara", "john", "michelle", "dan", 'timothy', "cindy", "george", "monica", "david", "rebecca"),
       response = sample(1:5, 10, replace = TRUE))

   name     response
   <chr>       <int>
 1 clara           3
 2 john            3
 3 michelle        2
 4 dan             2
 5 timothy         3
 6 cindy           5
 7 george          4
 8 monica          1
 9 david           2
10 rebecca         3

My task is to test whether the data is indeed likert scale, meaning that (1) values are integers, and (2) if we summarize the unique values, they are consecutive.

Testing whether all are integers can be done by

all((df_a$response - round(df_a$response)) == 0) ## https://stackoverflow.com/a/10114038/6105259

[1] TRUE

Testing whether unique values are consecutive [actually I don't know how to do this, but my problem doesn't end here].

My real problem is that likert scale could have different variations and that other strings might show in the data, adding noise.

valid likert scale could span different ranges, for example either 1-5, or 0-3, or 1-10 etc.
many times there will be additional strings such as "irrelevant", "I don't know", "I don't think so", "not applicable to me", and so on. I cannot anticipate which such additional strings will be present in the data, if any at all.

Under such circumstances, I need to detect whether my data is essentially likely to be from "likert scale".

Criteria to decide data is likert scale:

numeric values are integers.
when we take the unique values, they are consecutive (in the sense that sort(unique(df_a$response)) returns 1 2 3 4 5. If it had returned 1 3 4 5 then it would have failed the "consecutiveness" criteria)
the smallest value in the range is either 0 or 1
the greatest value is 10.
noise strings that aren't numeric (such as "I don't know", "abcd34", "irrelevant") account for less than 50% of the data

Below are 4 examples to demonstrate possible types of data and what I expect should happen when testing them for whether they're "likert" or not
In the examples I use stringi::stri_rand_strings to simulate the "noise" strings (e.g., "I don't know", "irrelevant", and other examples I gave above)

Example 1 -- testing for "is likert scale" should return `TRUE`

library(stringi)

set.seed(19)
val_begin <- sample(0:1, 1)
val_end <- sample(3:10, 1)
my_seq <- seq(from = val_begin, to = val_end)
additional_strings <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")

vec_example_1 <- sample(c(my_seq, additional_strings), size = 100 , replace = TRUE)

barplot(prop.table(table(vec_example_1)), main = "vec example 1)

Example 2 -- testing for "is likert scale" should return `FALSE`

In the following data, numbers are not consecutive

set.seed(19)
my_seq_2 <- sort(c(seq(0,4), seq(7, 9)))
additional_strings_2 <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")
vec_example_2 <- sample(c(my_seq_2, additional_strings_2), size = 100 , replace = TRUE)

barplot(prop.table(table(vec_example_2)), main = "vec example 2)

Example 3 -- testing for "is likert scale" should return `FALSE`

In the following data, the "additional strings" account for more than 50% of data, making it unlikely that the core of data is likert scale

set.seed(19)
vec_example_3 <- sample(c(rep(additional_strings, 70), sample(my_seq, 30, replace = T))) 
barplot(prop.table(table(vec_example_3)), main = "vec example 3")

Example 4 -- testing for "is likert scale" should return `FALSE`

Just random numbers and strings, no reason to believe this is a likert scale, even if it happens to be unique and consecutive, but 1 -> 30 is simply unlikely to be likert.

set.seed(19)
vec_example_4 <- sample(c(1:30, additional_strings), 1000, replace = T) 
barplot(prop.table(table(vec_example_4)), main = "vec example 4")

What I'm asking

I assume that a full solution would be pretty lengthy, so maybe it's too much to ask from people here. So I will be happy for even just tips, a general approach, or ideas how to tackle this.

Solution

You can write a function to identify if the vector follows the rules that we are looking for.

is_likert <- function(x) {
  only_numbers <- sort(as.numeric(unique(grep('^\\d+$', x, value = TRUE))))
  all_integers <- all(only_numbers %% 1 == 0)
  are_consecutive <- all(diff(only_numbers) == 1)
  ratio_of_numbers <- mean(grepl('^\\d+$', x))
  max_num <- max(only_numbers)
  min_num <- min(only_numbers)

  all_integers && are_consecutive && ratio_of_numbers > 0.5 && 
  max_num <= 10 && min_num <= 1
}

is_likert(vec_example_1)
#[1] TRUE
is_likert(vec_example_2)
#[1] FALSE
is_likert(vec_example_3)
#[1] FALSE
is_likert(vec_example_4)
#[1] FALSE

I hope the variable names are clear enough to demonstrate what they are doing.

Cleaning up likert scale data: How to test whether data is consecutive in addition to some unrelated strings?

My real problem is that likert scale could have different variations and that other strings might show in the data, adding noise.

Example 1 -- testing for "is likert scale" should return TRUE

Example 2 -- testing for "is likert scale" should return FALSE

Example 3 -- testing for "is likert scale" should return FALSE

Example 4 -- testing for "is likert scale" should return FALSE

What I'm asking

Example 1 -- testing for "is likert scale" should return `TRUE`

Example 2 -- testing for "is likert scale" should return `FALSE`

Example 3 -- testing for "is likert scale" should return `FALSE`

Example 4 -- testing for "is likert scale" should return `FALSE`