Search code examples
rdataframedplyr

How to order rows in custom order when there are duplicates?


Can you order the rows of a dataframe when the names are duplicated? I have a dataset that is very large and I cannot rename all the duplicated variables, it looks something like this

stat value
c 3
d 7
a 9
b 5
b 1
c 5
e 8
f 5

And I have to order the rows, alongside with its corresponding value, before exporting as an excel file (ie I need an excel with an exact format).

stat value
a 9
b 1
c 3
d 7
b 5
c 5
e 8
f 5

Edit: Sorry seems like there is some confusion on the order for "stat" in the second example. For the purpose of this question, please assume there is no specific logic for the order of "stat" in the second example, except that it has to be ordered exactly like so.

For context, I need to export the dataframe formatted in an very specific order, there's no particular reason for it except it is company standard procedure.


Data:

xyzzy = structure(list(stat = c("c", "d", "a", "b", "b", "c", "e", "f"
), value = c(3L, 7L, 9L, 5L, 1L, 5L, 8L, 5L)), class = "data.frame", row.names = c(NA, 
-8L))

Solution

  • the ordering of stat is clear to me but I don't understand if there is any certain order of value.

    Here is a solution using data.table:

    library(data.table)
    
    # n samples
    n <- 400
    set.seed(0123)
    
    # create table
    df <- data.table::data.table(
        stat = sample(letters[1:5], size = n, replace = TRUE),
        value = sample(0:100, size = n, replace = TRUE)
    )
    
    # set initial order
    data.table::setorder(df, stat, -value)
    
    # get position by stat
    df[, i := seq_len(.N), by = stat]
    
    # now order again
    data.table::setorder(df, i, stat)
    
    # remove the index
    df[, i := NULL]
    
    > head(df, 10)
          stat value
        <char> <int>
     1:      a   100
     2:      b   100
     3:      c   100
     4:      d   100
     5:      e   100
     6:      a    99
     7:      b   100
     8:      c    99
     9:      d    98
    10:      e    99
    
    > tail(df, 10)
          stat value
        <char> <int>
     1:      a     5
     2:      b     3
     3:      e     2
     4:      a     3
     5:      b     2
     6:      e     1
     7:      a     1
     8:      b     1
     9:      a     0
    10:      b     1
    

    However, if value should NOT be ordered in any specific way. Don't order by it originally

    # set initial order
    data.table::setorder(df, stat, -value)
    

    Should instead be:

    # set initial order
    data.table::setorder(df, stat)
    

    Edit

    If you want to order stat non-alphabetically, this will get you that:

    # n samples
    n <- 400
    set.seed(0123)
    
    # create table
    df <- data.table::data.table(
        stat = sample(letters[1:5], size = n, replace = TRUE),
        value = sample(0:100, size = n, replace = TRUE)
    )
    
    # add custom order
    custom_order <- data.table::data.table(
        stat = sample(letters[1:5]),
        order = 1:5
    )
    
    # merge to table
    df <- data.table::merge.data.table(
        df,
        custom_order,
        by = "stat"
    )
    
    # set initial order
    data.table::setorder(df, order, -value)
    
    # get position by stat
    df[, i := seq_len(.N), by = order]
    
    # now order again
    data.table::setorder(df, i, order)
    
    # remove the index & order
    df[, i := NULL]
    df[, order := NULL]
    

    Where

    > custom_order
         stat order
       <char> <int>
    1:      a     1
    2:      d     2
    3:      c     3
    4:      e     4
    5:      b     5
    
    > head(df)
         stat value
       <char> <int>
    1:      a   100
    2:      d   100
    3:      c   100
    4:      e   100
    5:      b   100
    6:      a    99
    
    > tail(df)
         stat value
       <char> <int>
    1:      e     1
    2:      b     2
    3:      a     1
    4:      b     1
    5:      a     0
    6:      b     1