Search code examples
rtextdplyrstringr

Why does stringr::str_order(x, numeric = T) sort data differently in conjunction with dplyr::arrange than with hard brackets?


I am trying to arrange a data.frame by a text column with some numeric values in it:

foo <- data.frame(x = c("A100", "A1", "A2", "A10", "A11"))

I am trying to sort it numerically using stringr::str_order(foo$x, numeric = TRUE) or something similar. I am trying to use this with dplyr::arrange but it is not arranging correctly. Here is what I have done:

dplyr::arrange(foo, stringr::str_order(x,numeric = T))

On my machine, this returns the values in the order of A11, A100, A1, A2, A10, as opposed to A1, A2, A10, A11, A100. This code works correctly:

foo[stringr::str_order(foo$x,numeric = T),]

I would expect these to do the same thing, but they don't, at least on my machine (Windows 10, R version 4.1.0) and my brother's (Mac, R version 4.0.2).

My question is, why is the output different? What am I missing? Is there a way to make str_order and arrange to work together?

I would like to be able to sort this column using dplyr::arrange so that I do not need to track down all of the places that I used arrange.

Thank you for your thoughts and time!


Solution

  • Note that str_order just like order returns the indix each element will contain in an ascending manner eg:

    str_order(foo$x,numeric = T)
    [1] 2 3 4 5 1
    

    Meaning the last element, ie the largest element currently is in position 1, while the first element, ie the smallest, is in position 2 of the current vector.

    On the other hand, arrange takes in the position that the elements should be once ordered, ie the ranks(with no ties).

    y <- c(100,1,2,10,11)
    order(y)
    [1] 2 3 4 5 1 # We do not want this
    rank(y)
    [1] 5 1 2 3 4 # We want this.
    

    Note that the rank states that the smallest object(1) is in position 2 and the largest object(5) is in position 1

    Now to obtain this, just order the ordered vector. Hence:

    arrange(foo, order(str_order(x,numeric = T)))
         x
    1   A1
    2   A2
    3  A10
    4  A11
    5 A100