Search code examples
rdplyreuclidean-distance

unique pairwise distances between any points in the dataframe


I have a list of ten points with X and coordinates. I would like to calculate the possible permutations of distances between any two points. Precisely, only one of the distances in 1-2, 2-1 should be present. I have managed to remove the distances of a point with itself. But couldn't achieve this permutation distances.

# Data Generation
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = 1:10)

# Temporary key Creation
df <- df %>% mutate(key = 1) 

# Calculating pairwise distances
df %>% full_join(df, by = "key") %>% 
  mutate(dist = sqrt((X.x - X.y)^2 + (Y.x - Y.y)^2)) %>% 
  select(ID.x, ID.y, dist) %>% filter(!dist == 0) %>% head(11)

# Output 
#    ID.x ID.y       dist
# 1     1    2 0.90858911
# 2     1    3 0.71154587
# 3     1    4 0.05687495
# 4     1    5 1.03885510
# 5     1    6 0.93747717
# 6     1    7 0.62070415
# 7     1    8 0.88351690
# 8     1    9 0.89651911
# 9     1   10 0.05079906
# 10    2    1 0.90858911
# 11    2    3 0.27530175

How to achieve the expected output shown below?

# Expected Output 
#    ID.x ID.y       dist
# 1     1    2 0.90858911
# 2     1    3 0.71154587
# 3     1    4 0.05687495
# 4     1    5 1.03885510
# 5     1    6 0.93747717
# 6     1    7 0.62070415
# 7     1    8 0.88351690
# 8     1    9 0.89651911
# 9     1   10 0.05079906
# 10    2    3 0.27530175
# 11    2    4 0.5415415

But this approach is computationally slower compared to dist(). Would be happier to listen to faster approaches.


Solution

  • I would use dist on the data and then process the output into the required format. You can replace dist with any other distance function. Here I've used letters rather than numbers as ID to better show what is happening

    set.seed(42)
    df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = letters[1:10])
    
    df %>% 
      column_to_rownames("ID") %>% #make the ID the rownames. dist will use these> NB will not work on a tibble
      dist() %>% 
      as.matrix() %>% 
      as.data.frame() %>% 
      rownames_to_column(var = "ID.x") %>% #capture the row IDs
      gather(key = ID.y, value = dist, -ID.x) %>% 
      filter(ID.x < ID.y) %>% 
      as_tibble()
    
       # A tibble: 45 x 3
        ID.x  ID.y      dist
       <chr> <chr>     <dbl>
     1     a     b 0.2623175
     2     a     c 0.7891034
     3     b     c 0.6856994
     4     a     d 0.2191960
     5     b     d 0.4757855
     6     c     d 0.8704269
     7     a     e 0.2730984
     8     b     e 0.3913770
     9     c     e 0.5912681
    10     d     e 0.2800021
    # ... with 35 more rows
    

    dist is very fast compared with looping through calculating distances. The code can probably be made more efficient, by working directly of the dist object rather than converting it into a matrix.