Search code examples
rdataframetidyverse

Order columns based of suffix condition in R


The name of my variables looks like this:

df <- data.frame(var_NA = 1:10, var = 11:20, var_Level = 21:30, var_Total = 31:40)

Except I have lots of variables. The key feature is that for every "mother" variable var, there are many "child" variables with different names (like var_NA and var_Level). Some "mothers" have more "children" than other. One thing is fixed though: there is always a child with suffix _NA.

What I want is to order columns like this:

  1. mother variable
  2. _NA child
  3. the other children (if available)
  4. mother variable
  5. _NA child
  6. the other children (if available) . . .

In my example, outcome would be var,var_NA,var_Level,var_Total.

I've given up trying with select(ends_with()), relocate() and other comments. This is probably done best with regex, of which I am totally ignorant. Any ideas?


Solution

  • Have updated answer to correspond to the changes in the question.

    Create nms to be the the names of df except that the name ending in _NA is replaced with the same name ending in just _ so that it sorts earlier. Note that the $ in _NA$ means the end so that _NA$ only matches to a name ending in _NA .

    Now the sorted order of nms applied to the columns of df sorts the columns of df as desired.

    nms <- sub("_NA$", "_", names(df))
    df[order(nms)]
    

    giving (continued after output):

       var var_NA var_Level var_Total
    1   11      1        21        31
    2   12      2        22        32
    3   13      3        23        33
    4   14      4        24        34
    5   15      5        25        35
    6   16      6        26        36
    7   17      7        27        37
    8   18      8        28        38
    9   19      9        29        39
    10  20     10        30        40
    

    Collating sequence

    Note that the actual sort order will depend on the LC_COLLATE setting of the locale. For example, note below that numbers sort before letters in both the English and C locale examples; however, in the C locale all upper case letters come before all lower case but not in the English locale. In the above solution the var column will come first and the var_NA column will come second (as it corresponds to var_ in nms) in both locales but the actual order within the remaining names will be locale dependent.

    Sys.getlocale() # shows locale being used including LC_COLLATE
    ## ..snip..
    
    x <- c("0", "1", "2", "a", "b", "c", "A", "B", "C")
    
    Sys.setlocale("LC_COLLATE", "en_US.utf8")
    sort(x)
    ## [1] "0" "1" "2" "a" "A" "b" "B" "c" "C"
    
    Sys.setlocale("LC_COLLATE", "C")
    sort(x)
    ## [1] "0" "1" "2" "A" "B" "C" "a" "b" "c"
    
    Sys.setlocale("LC_COLLATE", "") # set locale back to default