Search code examples
rduplicatesuniquecolumnsorting

How do I remove duplicate values within a column throughout the dataset in R


The dataframe looks like the following:

        var_1 var_2 var_3
        5     2     2
        5     2     1
        6     3     4

dput of data

data <- structure(list(var_1 = c(5L, 5L, 6L), var_2 = c(2L, 2L, 3L), 
  var_3 = c(2L, 1L, 4L)), row.names = c(NA, -3L), class = "data.frame")

How do I remove duplicate values within a column, and repeat this for all columns so the data looks like:

       var_1 var_2 var_3
       5     2     2
       6     3     1
                   4

Solution

  • Here's one approach, pivoting longer, filtering for distinct column-value combinations, then spreading:

    library(tidyverse)
    df %>%
      pivot_longer(cols = everything()) %>%
      distinct(name, value) %>%
      group_by(name) %>%
      mutate(row = row_number()) %>%
      ungroup() %>%
      pivot_wider(names_from = name, values_from = value)
    

    Result

    # A tibble: 3 x 4
        row var_1 var_2 var_3
      <int> <int> <int> <int>
    1     1     5     2     2
    2     2     6     3     1
    3     3    NA    NA     4
    

    If you want the sum of distinct numbers, you could add as a last line:

    %>% summarize(across(-row, ~sum(., na.rm = TRUE)))
    

    Result:

    # A tibble: 1 x 3
      var_1 var_2 var_3
      <int> <int> <int>
    1    11     5     7