Search code examples
rggplot2plotdplyrfrequency

How to plot the frequency of a categorical variable according to a quantitative variable


I need your help for a small problem. For a master's project, I need to plot the frequency of a behavior in a bird species according to the age in days. Unfortunately I can't provide all the data because it's confidential but I can give you the example I'm trying to make:

  • I have 7 different type of behaviors: Active, Feeding, etc...
  • I have a pool of individuals which were tagged and I have data for ages ranging from 0 to 1500 days (according to the tagging date).

What I need to do is see if for a given age, a behavior will be more frequent, and plot it.

I've tried several different methods, like this one I found in the site: First I tried to calculate the frequency of behaviors for every age:

 df %>%
 group_by(agesincetaggingdays, behaviors) %>%
 summarise(n = n()) %>%
 mutate(freq = n / sum(n))

This gave me:

 agesincetaggingdays behaviors     n     freq
             <dbl> <chr>     <int>    <dbl>
  1                   0 Active        5 0.000410
  2                   0 Feeding      49 0.0724 

Basically the outpout gives me the frequencies of the behaviors for each age in all individuals.

Now I want to know how I can extract these frequencies and do a plot with it, for each behavior. I take again my previous example:

If I want to see how active the birds are according to their age, I would have to extract all the frequencies of the Active behavior across all ages and then plot on the y axis the frequency of the behavior and on the x axis the age.

Is there a way how to do that ? Don't hesitate if you want more precision.

Thank you !


Solution

  • There is no need to extract anything, if I understand you correctly - the code you used already gives you a dataframe which can be passed to ggplot() to make a plot.

    I created some dummy data to illustrate the workflow. Of course you might want to choose a plot type that suits your needs (I used a stacked area chart below).

    library(tidyverse)
    
    # create sample data
    df <- tibble(
      behavior = sample(letters[1:5], size = 1e6, replace = TRUE),
      age = sample(0:1500, size = 1e6, replace = TRUE)
    )
    
    # compute shares
    df <- df |> 
      count(age, behavior) |> 
      mutate(share = n / sum(n),
             .by = age)
    
    # plot
    ggplot(df) +
      geom_area(aes(age, share, fill = behavior))
    

    Created on 2023-10-21 with reprex v2.0.2