Search code examples
rrscript

How to combine duplicate rows in R?


I'm using R to create a data frame with several duplicate columns. I'd like to consolidate all of the duplicated columns into a single column. How to do this in R?

Note 1 : When I build dataframes with multiple columns, R adds numbers to the names of the repeated columns by default.

Note 2: I am looking for code that will work with the columns regardless of their order.

code:

# Create the data frame.
emp.data <- data.frame(
  emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
  salary = c(623.3,515.2,611.0,729.0,843.25), 
  salary = c(700.3,600.2,721.0,730.5,845.4), 
  emp_name = c("Kevin","Tracy","Thompson","Peter","Bevan"),
  stringsAsFactors = FALSE
)
# Print the data frame.         
print(emp.data)

current result

         emp_name   salary   salary.1 emp_name.1
         Rick       623.3    700.3    Kevin
         Dan        515.20   600.2    Tracey
         Michelle   611.00   721.0    Thompson
         Ryan       729.00   730.5    Peter
         Gary       843.25   845.4    Bevan

Expected output

       emp_name   salary   
         Rick       623.3    
         Dan        515.20   
         Michelle   611.00   
         Ryan       729.00   
         Gary       843.25   
         Kevin      700.3
         Tracey     600.2
         Thompson   721.0
         Peter      730.5
         Bevan      845.4

Solution

  • Use melt from data.table

    library(data.table)
    melt(setDT(emp.data), measure = patterns("^emp_name", "salary"),
         value.name = c("emp_name", "salary"))[, variable := NULL][]
        emp_name salary
     1:     Rick 623.30
     2:      Dan 515.20
     3: Michelle 611.00
     4:     Ryan 729.00
     5:     Gary 843.25
     6:    Kevin 700.30
     7:    Tracy 600.20
     8: Thompson 721.00
     9:    Peter 730.50
    10:    Bevan 845.40
    

    data

    emp.data <- structure(list(emp_name = c("Rick", "Dan", "Michelle", "Ryan", 
    "Gary"), salary = c(623.3, 515.2, 611, 729, 843.25), salary = c(700.3, 
    600.2, 721, 730.5, 845.4), emp_name = c("Kevin", "Tracy", "Thompson", 
    "Peter", "Bevan")), class = "data.frame", row.names = c(NA, -5L
    ))