Search code examples
rdplyrreadr

How to delete duplicated columns in a tibble in the tidyverse


Let's say my data has two identical columns called SYC SJ Equity. When I import this into R using the base read.csv() and setting check.names to FALSE, the data gets imported with two columns having identical names. I can then delete the duplicated columns like so:

df <- df %>% 
  do(.[!duplicated(names(.))])

If I use the read_csv() function from the readr to import the data, the duplicated column gets the column name in the form of "SYC SJ Equity_1". I can then do the following to delete the duplicated column

df <- df %>%
  select(-contains("_"))

However, if the data is in a sheet in a .xlsx format, and I use the read_excel() function from the readxl package, the duplicated column gets imported as:

New names:
* `SYC SJ Equity` -> `SYC SJ Equity...406`
* `SYC SJ Equity` -> `SYC SJ Equity...407`

In this case, both columns get renamed, whereas with read_csv(), only the extra instance(s) gets renamed. In this case, I have to match the entire string to delete the duplicated column. I do not know what numbers will get assigned to the duplicated columns. I assume it's based on the column index? Furthermore, even with the read_csv() import, my solution won't work if my columns all naturally have underscores in their names in the raw data. So my question is, how do I delete a duplicated column in the tidyverse with data stored in tibbles? This is easy with base R, with duplicated(). But this doesn't work with the tibbles of the tidyverse. I understand that dplyr verbs select by column name and not by column index,so the column names always have to be different. But is there a way to always deleted the duplicated column with tidyverse, without looking at the console to see the new duplicated names?


Solution

  • Building off the answer provided by Ronak, if you want to do this in dplyr, then you can just use his provided solution with select_if.

    library(dplyr)
    
    df <- data.frame("x" = runif(3),
                     "SYC SJ Equity...406" = c("a", "a", "b"), 
                     "SYC SJ Equity...407" = c("a", "a", "b"), 
                     "y" = runif(3))
    
    df %>%
      select_if(!duplicated(sub("\\.\\.\\..*", "", names(.))))