Search code examples
rggplot2statisticsvisualizationmean

How to get mean/average line of multiple points (with different x-axis values) using R (or Excel)?


I have two dataframes (fire2022_12 and fire2021_12) with the same three columns (variable, x, y). I have pasted a snippet from the top of these dataframes below. I can plot these two dataframes in the same plot just fine, by doing this: ggplot() + geom_line(data=fire2022_12, aes(x, y)) + geom_line(data=fire2021_12, aes(x, y))

I now want to create a line that represents the mean/average of these two dataframes, so that I can see, on average, how 'time since fire' (x) affects the habitat suitability (y (0-1)) of a species. However, I am having trouble with this. I think what is causing issues is the fact that there isn't a perfect overlap between the x-axis values between both dataframes, so if an x-axis value is only present on one data frame, it adopts that one instead of omitting it (or aligning to rest of the mean/average 'curve').

fire2022_12

timesincefire202212v2,-115.1,0.8304737597703934
timesincefire202212v2,-113.71879999999999,0.8304737597703934
timesincefire202212v2,-112.3376,0.8304737597703934
timesincefire202212v2,-110.9564,0.8304737597703934
timesincefire202212v2,-109.5752,0.8304737597703934
timesincefire202212v2,-108.19399999999999,0.8304737597703934
timesincefire202212v2,-106.8128,0.8304737597703934
timesincefire202212v2,-105.4316,0.8304737597703934
timesincefire202212v2,-104.0504,0.8304737597703934
timesincefire202212v2,-102.66919999999999,0.8304737597703934
timesincefire202212v2,-101.288,0.8304737597703934
timesincefire202212v2,-99.9068,0.8304737597703934
timesincefire202212v2,-98.5256,0.8304737597703934
timesincefire202212v2,-97.14439999999999,0.8304737597703934
timesincefire202212v2,-95.7632,0.8304737597703934
timesincefire202212v2,-94.382,0.8304737597703934
timesincefire202212v2,-93.0008,0.8304737597703934
timesincefire202212v2,-91.61959999999999,0.8304737597703934
timesincefire202212v2,-90.2384,0.8304737597703934

fire2021_12

timesincefire2021_12,-113.9,0.9661756336688996
timesincefire2021_12,-112.53320000000001,0.9661756336688996
timesincefire2021_12,-111.16640000000001,0.9661756336688996
timesincefire2021_12,-109.7996,0.9661756336688996
timesincefire2021_12,-108.4328,0.9661756336688996
timesincefire2021_12,-107.066,0.9661756336688996
timesincefire2021_12,-105.6992,0.9661756336688996
timesincefire2021_12,-104.3324,0.9661756336688996
timesincefire2021_12,-102.96560000000001,0.9661756336688996
timesincefire2021_12,-101.59880000000001,0.9661756336688996
timesincefire2021_12,-100.232,0.9661756336688996
timesincefire2021_12,-98.8652,0.9661756336688996
timesincefire2021_12,-97.4984,0.9661756336688996
timesincefire2021_12,-96.1316,0.9661756336688996
timesincefire2021_12,-94.76480000000001,0.9661756336688996
timesincefire2021_12,-93.398,0.9661756336688996
timesincefire2021_12,-92.0312,0.9661756336688996
timesincefire2021_12,-90.6644,0.9661756336688996

The closest I have come to a desired result is this:

library(ggplot2)
library(tidyverse)
library(dplyr)

#merging two dataframes into one
data2 <- rbind(fire2022_12, fire2021_12)

#rounding the x-axis values to a whole number
data2$x <- round(data2$x, 0)

#obtaining the mean y-axis value for each x-axis value
grouped <- data2 %>% group_by(x) %>% summarise(y = mean(y))

#plotting the data 
ggplot(data=grouped, aes(x=x, y=y))+geom_point()

When I plot the data, it looks like there are 3 lines instead of 1. I think this is happening because for some x-axis values, there is only y-axis value for one of the dataframes and not both. For example, the x-axis value of -115 is only present on fire2022_12 and not on fire2021_12, so when it calculates the mean, it just takes fire2022_12's value. It would be nice if I could omit these outliers somehow, or force them to be aligned to the x-axis values that do have y-axis values across both dataframes.

I have included some screenshots below. Screenshot 1: My result My result

Screenshot 2: What the original plotted dataframes looked like before I tried to create a 'mean/average' line* Screenshot 2

Screenshot 3: The result I get when I set geom to 'line' instead of 'point' Screenshot 3

Thank you in advance! :)

note: I have also tried using ggplot2's stat_summary() but I run into a similar issue.


Solution

  • Rather than rounding, try using a smooth line:

    rbind(df1, df2) |>
      ggplot(aes(x, y)) +
      geom_smooth()
    

    Using Andre's nicely shared data, that ends up with this:

    enter image description here