Search code examples
rfactorslevelsforcats

Why is levels() not assigning the wrong level to my data?


I'm creating a function that requires users to upload a dataset with a vector of specific characters. Under the hood, I need one column that has the vector remain character, but I also need a separate column that is identical except that that it is a factor with specific levels.

When I try using levels() to assign the levels, I assumed R would match up the strings, but it's randomly assigning the order of the levels. How do I correct this behavior? Though the specific character values will always be the same, I won't know the order that users will upload them.

#Data to recreate the issue (note: The group and count columns are not relevant, 
# but I kept them in case they may be related to the issue for some reason)

library(dplyr)

data <- tibble(group=factor(c(rep("A", 10), rep("B", 10), rep("C", 10),
                              rep("D", 10)), levels=c("A", "B", "C", "D")),
               state=c(rep(c("Not Started", "Just Beginning",
                               "25% Complete", "40% Complete", "Halfway Done",
                               "75% Complete", "Mostly Done", "Completed",
                               "Follow Up", "Final Follow Up"), 4)),
               count=c(100, 5, 4, 445, 67, 44, 25, 877, 240, 353,
                         48, 51, 48, 40, 141, 34, 50, 45, 34, 35,
                         140, 5, 8, 0, 17, 42, 0, 5, 3, 75,
                         477, 20, 59, 13, 1065, 1, 50, 353, 73, 104))

data$state_factor <- as.factor(data$state)

levels(data$state_factor) <- c("Not Started", "Just Beginning",
                               "25% Complete", "40% Complete", "Halfway Done",
                               "75% Complete", "Mostly Done", "Completed",
                               "Follow Up", "Final Follow Up")

head(data, 20) #Note how the state and state_factor columns are not identical

I'm flexible how I can accomplish this (i.e., is there a function in forcats I'm missing?), but it needs to have these levels in these orders.


Solution

  • Update:

    Ok then you could use factor instead of as.factor and set levels directly:

    data$state_factor <- factor(data$state, levels=c("Not Started", "Just Beginning",
                                                        "25% Complete", "40% Complete", "Halfway Done",
                                                        "75% Complete", "Mostly Done", "Completed",
                                                        "Follow Up", "Final Follow Up"))
    

    Output:

    > head(data, 20)  
    # A tibble: 20 × 4
       group state           count state_factor   
       <fct> <chr>           <dbl> <fct>          
     1 A     Not Started       100 Not Started    
     2 A     Just Beginning      5 Just Beginning 
     3 A     25% Complete        4 25% Complete   
     4 A     40% Complete      445 40% Complete   
     5 A     Halfway Done       67 Halfway Done   
     6 A     75% Complete       44 75% Complete   
     7 A     Mostly Done        25 Mostly Done    
     8 A     Completed         877 Completed      
     9 A     Follow Up         240 Follow Up      
    10 A     Final Follow Up   353 Final Follow Up
    11 B     Not Started        48 Not Started    
    12 B     Just Beginning     51 Just Beginning 
    13 B     25% Complete       48 25% Complete   
    14 B     40% Complete       40 40% Complete   
    15 B     Halfway Done      141 Halfway Done   
    16 B     75% Complete       34 75% Complete   
    17 B     Mostly Done        50 Mostly Done    
    18 B     Completed          45 Completed      
    19 B     Follow Up          34 Follow Up      
    20 B     Final Follow Up    35 Final Follow Up
    

    Now they are not in alphabetical order:

    > levels(data$state_factor)
     [1] "Not Started"     "Just Beginning"  "25% Complete"    "40% Complete"    "Halfway Done"    "75% Complete"    "Mostly Done"     "Completed"      
     [9] "Follow Up"       "Final Follow Up"