Search code examples
rsimilarityvegancentroidmulti-dimensional-scaling

R extract distance between centroids to data frame using Vegan


I have a biological data set where I want to calculate the distance between centroids and each centroid represents a given year (so distance is calculated sequentially). I'm exploring usedist::dist_between_centroids() to calculate the distance in high dimensional space, but it seems quite arduous since the function requires vector inputs of the grouping variables (in this case, year). I've explored vegan::adonis() as an alternative function, but I can't figure out how to extract the distances. I've attached some sample data using Dune and recoded one of the factors as 'year.' My actual dataset consists of ~20 years worth of data, so manually calculating distances as I've done below is not practical. I think a loop with dist_between_centroids() might accomplish this task, but I'm not sure how to specify the grouping vectors in the loop.


# Species and environmental data
require(vegan)
require(usedist)

dune <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)

dune.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)

data(dune) 
data(dune.env)

all_data <- cbind(dune.env, dune) %>%
              arrange(Use)

all_data$Use <- recode_factor(all_data$Use, "Hayfield"="2017")
all_data$Use <- recode_factor(all_data$Use, "Haypastu"="2018")
all_data$Use <- recode_factor(all_data$Use, "Pasture"="2019")


bio_data <- all_data[,6:35] 

bio_distmat <- vegdist(bio_data, method = "bray", na.rm=T) 


#store distance in matrix
dist_between_mat <- as.data.frame(matrix(ncol=3, nrow=2))
colnames(dist_between_mat) <- c("start_centroid","end_centroid","distance")

dist_between_mat[1,1] = "2017"
dist_between_mat[1,2] = "2018"
dist_between_mat[1,3] = dist_between_centroids(bio_distmat, 1:7,8:15) #distance between 2017 and 2018

dist_between_mat[2,1] = "2018"
dist_between_mat[2,2] = "2019"
dist_between_mat[2,3] = dist_between_centroids(bio_distmat, 8:15,16:20) #distance between 2018 and 2019



Solution

  • You can do this with a simple for-loop. But why write simple code when we can use "tidy" principles instead?

    Here is a solution that iterates over the start years and the end years, generates a one-row tibble and then concatenates the rows into a final tibble.

    Note that in your reproducible example the years/levels are in reverse chronological order. I use the levels ordering, without casting the levels to years, so make sure that this is the order you intend.

    levels(all_data$Use)
    #> [1] "2019" "2018" "2017"
    
    n <- nlevels(all_data$Use)
    
    start <- levels(all_data$Use)[1:(n - 1)]
    start
    #> [1] "2019" "2018"
    end <- levels(all_data$Use)[2:n]
    end
    #> [1] "2018" "2017"
    
    map2_dfr(start, end, ~ {
      idx1 <- which(all_data$Use == .x)
      idx2 <- which(all_data$Use == .y)
      tibble(
        start_centroid = .x,
        end_centroid = .y,
        distance = dist_between_centroids(bio_distmat, idx1, idx2)
      )
    })
    #> # A tibble: 2 × 3
    #>   start_centroid end_centroid distance
    #>   <chr>          <chr>           <dbl>
    #> 1 2019           2018            0.210
    #> 2 2018           2017            0.327
    

    Created on 2022-07-27 by the reprex package (v2.0.1)