I have a SparkR DataFrame
with duplicate observations. I can't find a simple way to drop duplicates, and it seems that the PySpark dropDuplicates()
function is unavailable in SparkR. For example, if I have the following DataFrame
, how can I drop the 2nd and 4th rows based on the fact that fullname
is duplicated?
newHires <- data.frame(name = c("Thomas", "Thomas", "Bill", "Bill"),
surname = c("Smith", "Smith", "Taylor", "Taylor"),
value = c(1.5, 1.5, 3.2, 3.2))
newHires <- withColumn(newHires, 'fullname', concat(newHires$name, newHires$surname))
|name | surname | value | fullname |
|--------|---------|-------|-----------|
|Thomas | Smith | 1.5 |ThomasSmith|
|Thomas | Smith | 1.5 |ThomasSmith|
|Bill | Taylor | 3.2 |BillTaylor |
|Bill | Taylor | 3.2 |BillTaylor |
There is a function dropDuplicates
in sparkR as well you can use as
dropDuplicates(newHire, "fullname")
Please refer here
Hope this helped!