I need to clean up data that was collected with a likert scale. It means that observations in my data are from people who chose one option from an ordinal scale, such as "on a scale of 1-5, where 1 means awful and 5 means wonderful, how would you rate your liking of eggplants?"
Thus, a typical dataset will look like
library(tibble)
set.seed(123)
df_a <-
tibble(name = c("clara", "john", "michelle", "dan", 'timothy', "cindy", "george", "monica", "david", "rebecca"),
response = sample(1:5, 10, replace = TRUE))
name response
<chr> <int>
1 clara 3
2 john 3
3 michelle 2
4 dan 2
5 timothy 3
6 cindy 5
7 george 4
8 monica 1
9 david 2
10 rebecca 3
My task is to test whether the data is indeed likert scale, meaning that (1) values are integers, and (2) if we summarize the unique values, they are consecutive.
all((df_a$response - round(df_a$response)) == 0) ## https://stackoverflow.com/a/10114038/6105259
[1] TRUE
valid likert scale could span different ranges, for example either 1-5, or 0-3, or 1-10 etc.
many times there will be additional strings such as "irrelevant", "I don't know", "I don't think so", "not applicable to me", and so on. I cannot anticipate which such additional strings will be present in the data, if any at all.
Under such circumstances, I need to detect whether my data is essentially likely to be from "likert scale".
Criteria to decide data is likert scale:
sort(unique(df_a$response))
returns 1 2 3 4 5
. If it had returned 1 3 4 5
then it would have failed the "consecutiveness" criteria)0
or 1
10
.Below are 4 examples to demonstrate possible types of data and what I expect should happen when testing them for whether they're "likert" or not
In the examples I use stringi::stri_rand_strings
to simulate the "noise" strings (e.g., "I don't know", "irrelevant", and other examples I gave above)
TRUE
library(stringi)
set.seed(19)
val_begin <- sample(0:1, 1)
val_end <- sample(3:10, 1)
my_seq <- seq(from = val_begin, to = val_end)
additional_strings <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")
vec_example_1 <- sample(c(my_seq, additional_strings), size = 100 , replace = TRUE)
barplot(prop.table(table(vec_example_1)), main = "vec example 1)
FALSE
In the following data, numbers are not consecutive
set.seed(19)
my_seq_2 <- sort(c(seq(0,4), seq(7, 9)))
additional_strings_2 <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")
vec_example_2 <- sample(c(my_seq_2, additional_strings_2), size = 100 , replace = TRUE)
barplot(prop.table(table(vec_example_2)), main = "vec example 2)
FALSE
In the following data, the "additional strings" account for more than 50% of data, making it unlikely that the core of data is likert scale
set.seed(19)
vec_example_3 <- sample(c(rep(additional_strings, 70), sample(my_seq, 30, replace = T)))
barplot(prop.table(table(vec_example_3)), main = "vec example 3")
FALSE
Just random numbers and strings, no reason to believe this is a likert scale, even if it happens to be unique and consecutive, but 1 -> 30 is simply unlikely to be likert.
set.seed(19)
vec_example_4 <- sample(c(1:30, additional_strings), 1000, replace = T)
barplot(prop.table(table(vec_example_4)), main = "vec example 4")
I assume that a full solution would be pretty lengthy, so maybe it's too much to ask from people here. So I will be happy for even just tips, a general approach, or ideas how to tackle this.
You can write a function to identify if the vector follows the rules that we are looking for.
is_likert <- function(x) {
only_numbers <- sort(as.numeric(unique(grep('^\\d+$', x, value = TRUE))))
all_integers <- all(only_numbers %% 1 == 0)
are_consecutive <- all(diff(only_numbers) == 1)
ratio_of_numbers <- mean(grepl('^\\d+$', x))
max_num <- max(only_numbers)
min_num <- min(only_numbers)
all_integers && are_consecutive && ratio_of_numbers > 0.5 &&
max_num <= 10 && min_num <= 1
}
is_likert(vec_example_1)
#[1] TRUE
is_likert(vec_example_2)
#[1] FALSE
is_likert(vec_example_3)
#[1] FALSE
is_likert(vec_example_4)
#[1] FALSE
I hope the variable names are clear enough to demonstrate what they are doing.