Search code examples
rdataframedplyrtidyversedata-cleaning

Using distinct() function in R


I'm working with a large dataframe containing longitude and latitude coordinates, each in a different column. I would like to remove every duplicated row only if it has the same longitude AND latitude. Will this solve the problem?

distinct(dat, dat$longitude, dat$latitude, .keep_all = TRUE)

This seems to work, but I'm not sure if I'm removing rows that have only matching longitudes and different latitudes or the other way around.


Solution

  • Assuming you mean dplyr::distinct, it's pretty easy to test this with a toy example:

    dat <- data.frame(longitude = c(1, 2, 3, 1, 2, 3),
                      latitude = c(10, 11, 12, 10, 12, 10))
    dat
    #>   longitude latitude
    #> 1         1       10
    #> 2         2       11
    #> 3         3       12
    #> 4         1       10
    #> 5         2       12
    #> 6         3       10
    
    dplyr::distinct(dat, longitude, latitude, .keep_all = TRUE)
    #>   longitude latitude
    #> 1         1       10
    #> 2         2       11
    #> 3         3       12
    #> 4         2       12
    #> 5         3       10
    

    You can see that it has only removed the row where both variables were repeated.

    Incidentally, you might want to look again at the result of your own code on this dataset:

    distinct(dat, dat$longitude, dat$latitude, .keep_all = TRUE)
    #>   longitude latitude dat$longitude dat$latitude
    #> 1         1       10             1           10
    #> 2         2       11             2           11
    #> 3         3       12             3           12
    #> 4         2       12             2           12
    #> 5         3       10             3           10
    

    As Akrun pointed out, you don't want to include the dat$ when using tidy evaluation.