Search code examples
rsortingdplyrduplicates

Reorder values across rows and convert duplicates to NA in R


I am trying to reorder values rowwise in a dataframe in R.

data <- data.frame(sample = c("A", "B", "C", "D", "E", "F"),
                   year_1 = c(2015, 2014, NA, 1985, 2011, 2010),
                   year_2 = c(NA, 1986, 1999, 1986, 2009, 2009),
                   year_3 = c(2015, 2014, 2014, 1956, NA, 2010),
                   year_4 = c(NA, 2014, 2014, 1985, 2010, 2010))

data

>    sample year_1 year_2 year_3 year_4
>  1      A   2015     NA   2015     NA
>  2      B   2014   1986   2014   2014
>  3      C     NA   1999   2014   2014
>  4      D   1985   1986   1956   1985
>  5      E   2011   2009     NA   2010
>  6      F   2010   2009   2010   2010

I would like to reorder each row so that year_1 has the highest year, year_2 has the next highest, and so forth. I would also like to keep all columns, so any NAs would appear starting with year_4.

I have managed to reorder the values in each row as described above.

cols <- c("year_1", "year_2", "year_3", "year_4")
data[cols] <- t(apply(data[cols], 
                      MARGIN = 1,
                      FUN = function(x) sort(x, decreasing = TRUE, na.last = TRUE)))

data

>    sample year_1 year_2 year_3 year_4
>  1      A   2015   2015     NA     NA
>  2      B   2014   2014   2014   1986
>  3      C   2014   2014   1999     NA
>  4      D   1986   1985   1985   1956
>  5      E   2011   2010   2009     NA
>  6      F   2010   2010   2010   2009

HOWEVER, I would also like to remove any duplicates and replace with NA (to appear at the end of the row). For example, for Sample D, the row should be 1986, 1985, 1956, NA.

END GOAL:

>    sample year_1 year_2 year_3 year_4
>  1      A   2015     NA     NA     NA
>  2      B   2014   1986     NA     NA
>  3      C   2014   1999     NA     NA
>  4      D   1986   1985   1956     NA
>  5      E   2011   2010   2009     NA
>  6      F   2010   2009     NA     NA

Note: my data have several hundred thousand rows, so efficient code is much appreciated.


Solution

  • You were close! One option would be to just replace the duplicated values with NA before sorting:

    cols <- c("year_1", "year_2", "year_3", "year_4")
    
    data[cols] <- t(apply(data[cols], 1, \(x){
      x[duplicated(x)] <- NA
      sort(x, decreasing = TRUE, na.last = TRUE)
    }))  
    

    Output:

      sample year_1 year_2 year_3 year_4
    1      A   2015     NA     NA     NA
    2      B   2014   1986     NA     NA
    3      C   2014   1999     NA     NA
    4      D   1986   1985   1956     NA
    5      E   2011   2010   2009     NA
    6      F   2010   2009     NA     NA