Search code examples
rapache-sparkapache-spark-sqlsparkr

Removing duplicate observations in SparkR DataFrame


I have a SparkR DataFrame with duplicate observations. I can't find a simple way to drop duplicates, and it seems that the PySpark dropDuplicates() function is unavailable in SparkR. For example, if I have the following DataFrame, how can I drop the 2nd and 4th rows based on the fact that fullname is duplicated?

newHires <- data.frame(name = c("Thomas", "Thomas", "Bill", "Bill"),
  surname = c("Smith", "Smith", "Taylor", "Taylor"),
  value = c(1.5, 1.5, 3.2, 3.2))
newHires <- withColumn(newHires, 'fullname', concat(newHires$name, newHires$surname))
|name    | surname | value | fullname  |
|--------|---------|-------|-----------|
|Thomas  | Smith   |  1.5  |ThomasSmith|
|Thomas  | Smith   |  1.5  |ThomasSmith|
|Bill    | Taylor  |  3.2  |BillTaylor |
|Bill    | Taylor  |  3.2  |BillTaylor |

Solution

  • There is a function dropDuplicates in sparkR as well you can use as

    dropDuplicates(newHire, "fullname")
    

    Please refer here

    Hope this helped!