Search code examples
rdataframedplyrdata-cleaning

R, remove rows based on the values from multiple column


Suppose i have a dataframe with 100 rows and 100 columns.

For each row, if any 2 columns have the same value, then this row should be removed.

For example, if column 1 and 2 are equal, then this row should be removed.

Another example, if column 10 and column 47 are equal, then this row should be removed as well.

Example:

test <- data.frame(x1 = c('a', 'a', 'c', 'd'),
               x2 = c('a', 'x', 'f', 'h'),
               x3 = c('s', 'a', 'f', 'g'),
               x4 = c('a', 'x', 'u', 'a'))

test

  x1 x2 x3 x4
1  a  a  s  a
2  a  x  a  x
3  c  f  f  u
4  d  h  g  a

Only the 4th row should be kept.

How to do this in a quick and concise way? Not using for loops....


Solution

  • Use apply to look for duplicates in each row. (Note that this internally converts your data to a matrix for the comparison. If you are doing a lot of row-wise operations I would recommend either keeping it as a matrix or converting it to a long format as in Jack Brookes's answer.)

    # sample data
    set.seed(47)
    dd = data.frame(matrix(sample(1:5000, size = 100^2, replace = TRUE), nrow = 100))
    
    # remove rows with duplicate entries
    result = dd[apply(dd, MARGIN =  1, FUN = function(x) !any(duplicated(x))), ]