Search code examples
rdplyrtibble

Programmatically deal with duplicated columns using name_repair


I am importing a spreadsheet where I have a known vector of what the column headings were originally. When read_excel imports the data, it rightly complains of the duplicated columns and renames them to distinguish them. This is great behaviour. My question is how might I select (from the duplicated columns) the first occurrence of that duplicated column, drop all other duplicated ones and then rename the column back to the original name. I have a working script but it seems clunky. I always struggle to manipulate column headers programmatically within a pipeline.

library(readxl)
library(dplyr, warn.conflicts = FALSE)

cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")

datasets <- readxl_example("datasets.xlsx")

d <- read_excel(datasets, col_names = cols_names, skip = 1)
#> New names:
#> * Sepal.Length -> Sepal.Length...1
#> * Sepal.Length -> Sepal.Length...2
#> * Petal.Length -> Petal.Length...3
#> * Petal.Length -> Petal.Length...4


d_sub <- d %>% 
  select(!which(duplicated(cols_names)))

new_col_names <- gsub("\\.\\.\\..*","", colnames(d_sub))

colnames(d_sub) <- new_col_names

d_sub
#> # A tibble: 150 x 3
#>    Sepal.Length Petal.Length Species
#>           <dbl>        <dbl> <chr>  
#>  1          5.1          1.4 setosa 
#>  2          4.9          1.4 setosa 
#>  3          4.7          1.3 setosa 
#>  4          4.6          1.5 setosa 
#>  5          5            1.4 setosa 
#>  6          5.4          1.7 setosa 
#>  7          4.6          1.4 setosa 
#>  8          5            1.5 setosa 
#>  9          4.4          1.4 setosa 
#> 10          4.9          1.5 setosa 
#> # ... with 140 more rows

Created on 2020-04-08 by the reprex package (v0.3.0)

Any idea how to do this in a more streamlined manner?


Solution

  • Based on @rawr's comment, here is the answer as I see it:

    library(readxl)
    library(dplyr, warn.conflicts = FALSE)
    
    datasets <- readxl_example("datasets.xlsx")
    cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")
    
    d <- read_excel(datasets, col_names = cols_names, skip = 1, .name_repair = make.unique) %>% 
      select(all_of(cols_names))
    #> New names:
    #> * Sepal.Length -> Sepal.Length.1
    #> * Petal.Length -> Petal.Length.1
    

    Created on 2020-04-08 by the reprex package (v0.3.0)