Search code examples
rgroupingcluster-analysis

How to identify groups of interindividual revisitation?


I have an output data frame from the "recurse" package calculating the revisitation rate of several individuals based on GPS points. In the data frame I have 18 columns, including "site" and "id", and > 43,000 rows.

I have two questions: (1) What sites are used by multiple individuals and (2) which individuals share the same site.

I grouped the data frame by site and id to then filter only the sites with more than one connected id:

library(tidyverse)
sites <- tab %>% group_by(site, id) %>% summarise(n = n()) # gives me all sites and connected ids
sites2 <- sites %>% group_by(site) %>% summarise(n = n()) # gives me all sites and the count of connected ids
sites3 <- subset(sites2, n > 1) # gives me only sites with more than one connected id
#filter original data frame to only have sites that are connected to more that one id
filtered <- left_join(sites3, df, by = "site")
#group again by site and id
filtered2 <- filtered %>% group_by(site, id) %>% summarise(n = n())

I'm not an expert in R so I guess there would've been an easier or cleaner way to do this, but it worked with my R knowledge. With this I know which sites are visited by different individuals. Now I have something that looks like this:

# A tibble: 3,041 x 3
   site         id     n
   <chr>     <int> <int>
 1 site 1      152     2
 2 site 1      160    13
 3 site 1000   164     4
 4 site 1000   166     1
 5 site 1001   164     2
 6 site 1001   166     1
 7 site 1002   164     4
 8 site 1002   166     1
 9 site 1003   164     3
10 site 1003   166     3
# ... with 3,031 more rows

Now I'm stuck. I would like to assign "groups" to the individuals using the same site. For example you can see that id 152 and 160 are both using site 1, and 164 and 166 use the sites 1000, 1001, 1002, and so on. In this case, "group1" would be assigned to id 152 and 160, and "group2" to 164 and 166.

Is there a way to do that in R? There are 37 individuals and still > 3,000 rows of data, so it's a lot to go through by hand. Some sites are used by 3 or 4 individuals, and I'm not sure if there are always the same combinations of id connected to a site, so I can't define the groups beforehand.

Here is a snippet of the grouped data frame:

df <- structure(list(site = c("site 1", "site 1", "site 1000", "site 1000", 
"site 1001", "site 1001", "site 1002", "site 1002", "site 1003", 
"site 1003", "site 1007", "site 1007", "site 1008", "site 1008", 
"site 1009", "site 1009", "site 1015", "site 1015", "site 1019", 
"site 1019", "site 1020", "site 1020", "site 1022", "site 1022", 
"site 1024", "site 1024", "site 1034", "site 1034", "site 1035", 
"site 1035", "site 1036", "site 1036", "site 107", "site 107", 
"site 108", "site 108", "site 111", "site 111", "site 131", "site 131", 
"site 132", "site 132", "site 133", "site 133", "site 134", "site 134", 
"site 135", "site 135", "site 136", "site 136"), id = c(152L, 
160L, 164L, 166L, 164L, 166L, 164L, 166L, 164L, 166L, 164L, 166L, 
164L, 166L, 164L, 166L, 164L, 166L, 164L, 166L, 164L, 166L, 164L, 
166L, 164L, 166L, 164L, 166L, 164L, 166L, 164L, 166L, 155L, 161L, 
155L, 161L, 155L, 161L, 155L, 161L, 155L, 161L, 155L, 161L, 155L, 
161L, 155L, 161L, 155L, 161L), n = c(2L, 13L, 4L, 1L, 2L, 1L, 
4L, 1L, 3L, 3L, 5L, 8L, 4L, 6L, 5L, 17L, 1L, 1L, 3L, 1L, 3L, 
2L, 3L, 1L, 3L, 1L, 3L, 1L, 1L, 5L, 1L, 4L, 5L, 3L, 5L, 3L, 2L, 
1L, 5L, 3L, 5L, 3L, 5L, 3L, 5L, 3L, 5L, 3L, 4L, 2L)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -50L), groups = structure(list(
    site = c("site 1", "site 1000", "site 1001", "site 1002", 
    "site 1003", "site 1007", "site 1008", "site 1009", "site 1015", 
    "site 1019", "site 1020", "site 1022", "site 1024", "site 1034", 
    "site 1035", "site 1036", "site 107", "site 108", "site 111", 
    "site 131", "site 132", "site 133", "site 134", "site 135", 
    "site 136"), .rows = structure(list(1:2, 3:4, 5:6, 7:8, 9:10, 
        11:12, 13:14, 15:16, 17:18, 19:20, 21:22, 23:24, 25:26, 
        27:28, 29:30, 31:32, 33:34, 35:36, 37:38, 39:40, 41:42, 
        43:44, 45:46, 47:48, 49:50), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -25L), .drop = TRUE))

Thank you!


Solution

  • Okay, I found a work-around. Might not be the most elegant, but in case someone has the same question:

    I used again dplyr and collapsed all the ids per group to then group it by the new group:

    df <- filtered2 %>% 
      group_by(site) %>% 
      mutate(groups = paste0(id, collapse = " "))
    df2 <- df %>% group_by(groups) %>% summarise(n = n())
    

    This then gave me all combinations of ids (and also how often these combinations occur) like I needed.