Search code examples
rtidyversedata-cleaningdata-wrangling

How to Rearrange Values in Each Row to Avoid Duplicates Across Columns in R?


Question

I have a data frame in R where each row contains multiple columns with categorical values. My goal is to rearrange the values within each row so that no value is repeated across columns in the same row. The original data frame may contain missing values represented as empty strings or NA, and I want to keep the same number of values per column after the rearrangement.

The main rules are:

  1. For each time (t1, t2, t3, t4), values should appear the same amount of time as possible.
  2. Within the same row, values should not be repeated. (the order doesn't matter)
  3. For t1 and t2, all rows should have a value, for t3 and t4 only a specific % of the rows should have a value. in this case, this example I'm choosing 40%.

Here is an example of my input data:

df <- data.frame(
  t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t2 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t3 = c("A", "B", "C", "D", "", "", "", "", "", ""),
  t4 = c("A", "B", "C", "D", "", "", "", "", "", "")
)

Expected Output

I want to rearrange each row so there are no duplicate values across columns, it doesn't matter the sequence, for instance, a row could be A, B, C, NA, or A, B, C, D, as far it doesn't have a repeated value, I'm ok. I also need to preserve the number of non-missing values from the original data frame. Here is an example of the desired output:

# Example of expected rearrangement (order may vary):
df_rearranged <- data.frame(
  t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t2 = c("D", "A", "B", "C", "D", "A", "B", "C", "D", "A"),
  t3 = c("B", "", "A", "", "C", "", "D", "", "", ""),
  t4 = c("", "", "", "A", "", "C", "", "", "", "D")
)

Explanation

For context, each column indicates one time many coders (values in time columns) will rate an item (RID). Each new time, I need the coder to rate a different item. The coders will code all items for the first two times (time1 and time2); however, for time3 and time4, coders will code only 25% of the items (in the example above, I used 40% for simplicity, but this % will vary so I need syntax to adjust that automatically). Any help is appreciated; I'm stuck here.

REAL DATA

input

structure(list(RID = c(2L, 9L, 14L, 24L, 44L, 64L, 95L, 116L, 
165L, 169L, 170L, 171L, 172L, 177L, 192L, 215L, 217L, 226L, 246L, 
247L, 288L, 292L, 300L, 306L, 313L, 316L, 339L, 344L, 352L, 355L, 
375L, 378L, 384L, 421L, 476L, 488L, 493L, 495L, 498L, 503L, 532L, 
553L, 581L, 588L, 604L, 605L, 608L, 639L, 640L, 642L, 664L, 669L, 
702L, 742L, 744L, 746L, 749L, 756L, 767L, 820L, 822L, 824L, 825L, 
826L, 842L, 843L, 856L, 865L, 895L, 901L, 916L, 920L, 921L, 929L, 
930L, 934L, 936L, 939L, 952L, 958L), time1 = c("MV", "MV", "AF", 
"RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP", 
"MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL", 
"AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF", 
"RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV", 
"AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL", 
"RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC", 
"AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"
), time2 = c("RP", NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA, 
"FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA, "FL", "AL", 
"MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC", NA, NA, 
"AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", 
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("MV", 
"MV", "AF", "RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF", 
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", 
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL", 
"FL", "AF", "RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP", 
"MV", "MV", "AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV", 
"AL", "AL", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", 
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC", 
"FL", "AF"), time4 = c("RP", NA, NA, NA, NA, NA, "AL", "RP", 
NA, NA, NA, "FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA, 
"FL", "AL", "MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC", 
NA, NA, "AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA, 
NA, "MV", NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"
)), row.names = c(NA, -80L), class = c("tbl_df", "tbl", "data.frame"
))

Expected Output

structure(list(RID = c(2, 9, 14, 24, 44, 64, 95, 116, 165, 169, 
170, 171, 172, 177, 192, 215, 217, 226, 246, 247, 288, 292, 300, 
306, 313, 316, 339, 344, 352, 355, 375, 378, 384, 421, 476, 488, 
493, 495, 498, 503, 532, 553, 581, 588, 604, 605, 608, 639, 640, 
642, 664, 669, 702, 742, 744, 746, 749, 756, 767, 820, 822, 824, 
825, 826, 842, 843, 856, 865, 895, 901, 916, 920, 921, 929, 930, 
934, 936, 939, 952, 958), time1 = c("MV", "MV", "AF", "RP", "MV", 
"AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP", "MV", "AF", 
"RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL", "AC", "AF", 
"RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF", "RP", "AC", 
"RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV", "AC", "MV", 
"AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL", "RP", "AL", 
"FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC", "AL", "MV", 
"AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"), time2 = c("RP", 
NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA, "FL", NA, NA, NA, 
NA, NA, NA, NA, "MV", NA, NA, NA, NA, "AL", "MV", NA, "AL", NA, 
"FL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, "AF", NA, NA, NA, "AF", "AC", NA, NA, "AC", NA, NA, NA, NA, 
NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", NA, "MV", NA, "AF", 
NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("FL", "AF", "MV", 
"MV", "AF", "RP", "MV", "AC", "RP", "AL", "FL", "MV", "FL", "AF", 
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", 
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "MV", "AF", "AL", 
"FL", "AF", "FL", "AC", "RP", "RP", "AL", "AL", "FL", "FL", "RP", 
"MV", "RP", "AC", "MV", "AL", "AL", "RP", "FL", "AF", "AC", "MV", 
"AL", "MV", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", 
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "AC", "AC"
), time4 = c(NA, NA, NA, NA, "AL", NA, NA, NA, NA, NA, "RP", 
NA, NA, NA, "FL", NA, NA, "MV", NA, NA, NA, NA, "FL", NA, NA, 
"AL", "MV", NA, "AL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, "AF", NA, NA, "AF", NA, NA, NA, "AC", NA, 
NA, NA, "AC", NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", 
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", "FL", "RP", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -80L))

Solution

  • This seems like a Multidimensional assignment problem.

    In your case, you would like to arrange the values in columns t2, t3, and t4 such that there are no duplicates across rows. The Hungarian Algorithm Solver (HungarianSolver) from the RcppHungarian package seems to be a good choice here. The function "solves weighted bipartite matching problems (e.g., optimal matching of people to cars or optimal matching of students to colleges, etc...)"

    Use of the function is straightforward. It needs just one argument, a cost matrix. I think for this situation, we simply need 0s and 1s where a 0 indicates no cost, i.e. any pairing is allowed, and a 1 indicates some non-zero cost where that particular pairing is not desired. For this, I use outer with == as the FUN (excluding NA). The result is a cost matrix which can be used as input to the solver, giving a vector of pairings, the second column being the desired indices that minimizes the cost.

    library(RcppHungarian)
    

    Since you provided a toy dataset and a real one, I'll wrap my code in a function so that I can call it twice. The only argument is the data.

    fn <- function(data) {
      # Helper function for the `outer` function.
      equal <- function(x, y) (x==y) & !is.na(x) & !is.na(y)
      
      # Extract the four columns
      t1 <- data[,1, drop=TRUE]
      t2 <- data[,2, drop=TRUE]
      t3 <- data[,3, drop=TRUE]
      t4 <- data[,4, drop=TRUE]
      
      # Create the cost matrix for t1 and t2
      cost2 <- outer(t1, t2, FUN=equal)
    
      # Solve the problem for t2 and assign the result
      res2 <- HungarianSolver(cost2)
      t2a <- t2[res2$pairs[,2]]
      
      # Repeat for t3 and t4 (aggregating the costs)
      cost3 <- outer(t1, t3, equal) + outer(t2a, t3, equal)
      res3 <- HungarianSolver(cost3)
      t3a <- t3[res3$pairs[,2]]
    
      cost4 <- outer(t1, t4, equal) + outer(t2a, t4, equal) + outer(t3a, t4, equal)
      res4 <- HungarianSolver(cost4)
      t4a <- t4[res4$pairs[,2]]
      
      return(list(data=data.frame(t1, t2=t2a, t3=t3a, t4=t4a),
                  cost=c(res2$cost, res3$cost, res4$cost)))
    }
    

    Call the above function for the toy data set

    fn(df)
    

    $data
       t1 t2   t3   t4
    1   A  B    C    D
    2   B  A    D    C
    3   C  D    A    B
    4   D  C    B    A
    5   A  B <NA> <NA>
    6   B  A <NA> <NA>
    7   C  D <NA> <NA>
    8   D  C <NA> <NA>
    9   A  B <NA> <NA>
    10  B  A <NA> <NA>
    
    $cost
    [1] 0 0 0
    

    We see that the rows are not duplicated and the cost is 0. Now we try on the real data.

    DF_arranged <- fn(DF[,-1])
    head(DF_arranged$data, 10)
    

       t1 t2   t3   t4
    1  MV AF   RP   FL
    2  MV RP <NA> <NA>
    3  AF MV <NA>   RP
    4  RP MV   FL <NA>
    5  MV AC <NA> <NA>
    6  AC MV <NA> <NA>
    7  RP MV   AL <NA>
    8  MV RP <NA> <NA>
    9  FL MV   RP <NA>
    10 MV FL <NA>   RP
    ...
    

    The first ten rows look good (no duplicates across rows). Further inspection verifies the remaining rows.

    sum(apply(DF_arranged, 1, FUN=\(x) sum(duplicated(x, incomparables=NA))))
    # [1] 0
    
    DF_arranged$cost
    # [1] 0 0 0
    

    There are no duplicates and the cost is also 0.


    Data:

    df is the toy data (provided by OP).

    DF is the real data (also provided by OP except that time2 and time3 were interchanged since it was stated that all rows in time1 and time2 should have a value)