My data has an unexpected factor that combines two levels with &: "intermediate 7 & 8"
What would be the best way to re-level this value? In a future, there's a chance the factor can be combined as this way too , such as "Beginner 3 & 4" etc.
#Relevel factors
Sample <- as.factor(c("Beginner 1","intermediate 8", "intermediate 7 & 8",
"Expert 2","Expert 10","Beginner 3 & 4","Beginner 5",
"Beginner 10", "intermediate 1", "Expert 1", NA))
newLevel <- factor(c("NA", paste0("Beginner ", 1:10), paste0("intermediate ", 1:10),
paste0("Expert ", 1:10)))
newSample <- factor(Sample, levels=newLevel)
newSample
# [1] Beginner 1 intermediate 8 <NA> Expert 2 Expert 10
# [6] Beginner 3 Beginner 5 Beginner 10 intermediate 1 Expert 1
# [11] <NA>
# 31 Levels: NA Beginner 1 Beginner 2 Beginner 3 Beginner 4 Beginner 5 ... Expert 10
#Change factor to Numeric
SampleNum <- as.numeric(factor(Sample, levels=newLevel))
SampleNum
# [1] 2 19 NA 23 31 4 6 11 12 22 NA
So "intermediate 7 & 8" is considered as NA. It has to be between "intermediate 7" and "intermediate 8".
Any good ideas to factorize it and possible to convert to numeric?
You could strip off the numbers and calculate the mean
if there are two occurrences to get quasi-numerical suffix
es.
suffix <- sapply(strsplit(trimws(gsub("\\D+", " ", levels(Sample))), " "), function(x)
mean(as.numeric(x)))
Then, to get prefix
es convert the categories into higher numbers with the right order using cat.df
as an assignment matrix.
cat.df <- data.frame(c("Beginner", "intermediate", "Expert"),
(1:3)*100)
prefix <- sapply(gsub("(\\D+)\\s.*", "\\1", levels(Sample)), function(x, y)
cat.df[match(x, y), 2], cat.df[, 1])
That's all to relevel the Sample
vector.
new.Sample <- factor(Sample, levels=levels(Sample)[order(prefix + suffix)])
# [1] Beginner 1 intermediate 8 intermediate 7 & 8 Expert 2
# [5] Expert 10 Beginner 3 & 4 Beginner 5 Beginner 10
# [9] intermediate 1 Expert 1 <NA>
# 10 Levels: Beginner 1 Beginner 3 & 4 Beginner 5 Beginner 10 ... Expert 10
data.frame(sort(new.Sample), as.numeric(sort(new.Sample)))
# sort.new.Sample. as.numeric.sort.new.Sample..
# 1 Beginner 1 1
# 2 Beginner 3 & 4 2
# 3 Beginner 5 3
# 4 Beginner 10 4
# 5 intermediate 1 5
# 6 intermediate 7 & 8 6
# 7 intermediate 8 7
# 8 Expert 1 8
# 9 Expert 2 9
# 10 Expert 10 10
as.numeric(new.Sample)
# [1] 1 7 6 9 10 2 3 4 5 8 NA
Data
Sample <- structure(c(1L, 10L, 9L, 7L, 6L, 3L, 4L, 2L, 8L, 5L, NA), .Label = c("Beginner 1",
"Beginner 10", "Beginner 3 & 4", "Beginner 5", "Expert 1", "Expert 10",
"Expert 2", "intermediate 1", "intermediate 7 & 8", "intermediate 8"
), class = "factor")