Search code examples
rdplyrcountsubsetsummary

How to look at a certain value/category of a variable according to another one (Large dataset)


I'm writing this question according to another problem I have in R studio. I have a very large dataset that consists of movement data (ACC) for birds and I have multiple rows per individuals (each row represents a timestamp). In my dataset, I need to look at how many individuals I have in a certain territory. The problem here is that I have many rows for each individual, and using simple function like table or summary returns the number of rows assigned to this territory. What I would like is to know individuals that belong to the territory, using a simple function.

This is what I've done so far:

  • I have many rows in my dataframe, but only around 50 individuals (that have multiple rows each).
  • I have around 15 territories in total, and each row has one territory ID (repeated).

I've tried to use table

table(df$territory_id) %>% sort(decreasing = TRUE) %>% head

which gives me the output:

ter1  ter2  ter3  ter4  ter5  ter6 
275034 207746 232739 165260 162103 259644

Here I have the number of rows that have the territory ID. Because I wanted to know how many different individuals belong in a territory, I've subsetted the territories in separate object and done tables for that:

t <- filter(df, territory == "ter1")

and then:

table(t$individualID)

which gave me the output I wanted. However, I need to repeat the process for each territory.

I would like to know if there is a simpler way to do that? I only have 15 territories but if I had more, that would take a lot of time to repeat the function. Is there a simpler way to do that?


Solution

  • Your data seems like it's quite large, so while your head gives an idea of what the data looks like, it's not great to work with (because it seems like it's six timestamps for one bird in one location). So I created my own, which is hopefully still similar:

    library(tidyverse)
    set.seed(0)
    
    df <- data.frame(
        bird_id = rep(1:10, each = 10),
        territory_id = sample(LETTERS[1:10], 100, replace = TRUE),
        timestamp =  ymd_hms("2023-01-01 12:00:00") + sample(1:10000000, 100, replace = TRUE))
    
    > head(df)
      bird_id territory_id           timestamp
    1       1            I 2023-03-05 03:57:14
    2       1            D 2023-01-01 21:06:37
    3       1            G 2023-03-01 07:23:02
    4       1            A 2023-02-23 01:09:48
    5       1            B 2023-03-29 22:41:45
    6       1            G 2023-01-29 03:29:01
    

    So while it's clear to me you want to analyse your dataset, I'm unsure what specifically you are trying to do. So here are a few things you could be wanting, and how you could do them:

    # 1. get the number of birds you have seen at any point in each territory
    df |>
      distinct(territory_id, bird_id) |>
      count(territory_id)
    
    # 2. count the number of rows in your dataset for each territory
    count(df, territory_id)
    
    # 3. count the number of rows in your dataset for each territory and bird
    
    count(df, territory_id, bird_id)