Search code examples
rampersand

How to relevel the factor that combines two levels with "&"


My data has an unexpected factor that combines two levels with &: "intermediate 7 & 8"

What would be the best way to re-level this value? In a future, there's a chance the factor can be combined as this way too , such as "Beginner 3 & 4" etc.

#Relevel factors
Sample <- as.factor(c("Beginner 1","intermediate 8", "intermediate 7 & 8", 
                     "Expert 2","Expert 10","Beginner 3 & 4","Beginner 5",
                     "Beginner 10", "intermediate 1", "Expert 1", NA))
newLevel <- factor(c("NA", paste0("Beginner ", 1:10), paste0("intermediate ", 1:10), 
                   paste0("Expert ", 1:10)))
newSample <- factor(Sample, levels=newLevel)

newSample
# [1] Beginner 1     intermediate 8 <NA>           Expert 2       Expert 10     
# [6] Beginner 3     Beginner 5     Beginner 10    intermediate 1 Expert 1      
# [11] <NA>          
#   31 Levels: NA Beginner 1 Beginner 2 Beginner 3 Beginner 4 Beginner 5 ... Expert 10

#Change factor to Numeric
SampleNum <- as.numeric(factor(Sample, levels=newLevel))
SampleNum
# [1]  2 19 NA 23 31  4  6 11 12 22 NA

So "intermediate 7 & 8" is considered as NA. It has to be between "intermediate 7" and "intermediate 8".

Any good ideas to factorize it and possible to convert to numeric?


Solution

  • You could strip off the numbers and calculate the mean if there are two occurrences to get quasi-numerical suffixes.

    suffix <- sapply(strsplit(trimws(gsub("\\D+", " ", levels(Sample))), " "), function(x) 
      mean(as.numeric(x)))
    

    Then, to get prefixes convert the categories into higher numbers with the right order using cat.df as an assignment matrix.

    cat.df <- data.frame(c("Beginner", "intermediate", "Expert"),
                          (1:3)*100)
    prefix <- sapply(gsub("(\\D+)\\s.*", "\\1", levels(Sample)), function(x, y) 
      cat.df[match(x, y), 2], cat.df[, 1])
    

    That's all to relevel the Sample vector.

    new.Sample <- factor(Sample, levels=levels(Sample)[order(prefix + suffix)])
    #  [1] Beginner 1         intermediate 8     intermediate 7 & 8 Expert 2          
    #  [5] Expert 10          Beginner 3 & 4     Beginner 5         Beginner 10       
    #  [9] intermediate 1     Expert 1           <NA>              
    # 10 Levels: Beginner 1 Beginner 3 & 4 Beginner 5 Beginner 10 ... Expert 10
    

    Check

    data.frame(sort(new.Sample), as.numeric(sort(new.Sample)))
    #      sort.new.Sample. as.numeric.sort.new.Sample..
    # 1          Beginner 1                            1
    # 2      Beginner 3 & 4                            2
    # 3          Beginner 5                            3
    # 4         Beginner 10                            4
    # 5      intermediate 1                            5
    # 6  intermediate 7 & 8                            6
    # 7      intermediate 8                            7
    # 8            Expert 1                            8
    # 9            Expert 2                            9
    # 10          Expert 10                           10
    

    Conversion to numeric

    as.numeric(new.Sample)
    # [1]  1  7  6  9 10  2  3  4  5  8 NA
    

    Data

    Sample <- structure(c(1L, 10L, 9L, 7L, 6L, 3L, 4L, 2L, 8L, 5L, NA), .Label = c("Beginner 1", 
    "Beginner 10", "Beginner 3 & 4", "Beginner 5", "Expert 1", "Expert 10", 
    "Expert 2", "intermediate 1", "intermediate 7 & 8", "intermediate 8"
    ), class = "factor")