Search code examples
rr-factor

How to assign factor levels by the weight of strings in a vector?


I want to assign factor levels depending on the string weight. Normally R appears to assign factor levels alphabetically:

set.seed(54)

x <- sample(1:10, 5000, replace = TRUE)
x <- "levels<-"(as.factor(x), LETTERS[1:10])

> summary(x)
  A   B   C   D   E   F   G   H   I   J 
524 508 519 489 477 496 507 526 473 481 

I can reorder the factor levels and reassign them like this:

l <- data.frame(x=summary(x), old.levels=names(summary(x)), 
                        row.names = NULL)

l <- transform(l[order(summary(x)), ],
               new.levels=LETTERS[1:10])

levels(x) <- l[order(l$old.levels), 3]

> summary(x)
  I   G   H   D   B   E   F   J   A   C 
524 508 519 489 477 496 507 526 473 481 

But by this I haven't changed the factor values:

> summary(as.factor(as.numeric(x)))
  1   2   3   4   5   6   7   8   9  10 
524 508 519 489 477 496 507 526 473 481 

How can I get smartly what I want?


Solution

  • Default behaviour for factor is indeed to order the levels and then assign labels.

    set.seed(54)
    
    x <- sample(letters[1:10], 5000, replace = TRUE)
    
    f1 <- factor(x, labels = LETTERS[1:10])
    f2 <- factor(x, levels = sort(unique(x)), LETTERS[1:10])
    
    summary(f1)
    #>   A   B   C   D   E   F   G   H   I   J 
    #> 524 508 519 489 477 496 507 526 473 481
    identical(f1, f2)
    #> [1] TRUE
    

    If you just want the labels assigned in frequency order, you can do that by reordering the labels when creating the factor:

    f3 <- factor(x, levels = sort(unique(x)), LETTERS[1:10][order(table(x))])
    summary(f3)
    #>   I   E   J   D   F   G   B   C   A   H 
    #> 524 508 519 489 477 496 507 526 473 481
    

    If you want the labels in frequency order and the levels sorted alphabetically, order the levels during factor creation instead:

    f4 <- factor(x, levels = sort(unique(x))[order(table(x))], LETTERS[1:10])
    summary(f4)
    #>   A   B   C   D   E   F   G   H   I   J 
    #> 473 477 481 489 496 507 508 519 524 526
    

    Created on 2018-03-16 by the reprex package (v0.2.0).