I have a data table that looks like this (from the CSV) outlining voting data. What I need to know is how many votes come in per day (average) by year, by doing a linear regression over votesneeded ~ dayuntilelection. The slope would be the average votes coming in per day.
How can I run a linear regression function over this dataframe by year?
date,year,daysuntilelection,votesneeded
2018-01-25,2018,9,40
2018-01-29,2018,5,13
2018-01-30,2018,4,-11
2018-02-03,2018,0,-28
2019-01-23,2019,17,81
2019-02-01,2019,8,-4
2019-02-09,2019,0,-44
2020-01-17,2020,22,119
2020-01-24,2020,15,58
2020-01-30,2020,9,12
2020-02-03,2020,5,-4
2020-02-07,2020,1,-12
2021-01-08,2021,29,120
2021-01-26,2021,11,35
2021-01-29,2021,8,17
2021-02-01,2021,5,-2
2021-02-03,2021,3,-8
2021-02-06,2021,0,-10
The preferred output would be a dataframe looking something like this
year averagevotesperday
2018 8.27
2019 7.40
2020 6.55
2021 4.60
note: full data sets and analyses are at https://github.com/robhanssen/glenlake-elections, for the curious.
Do you need something like this?
library(dplyr)
dat |>
group_by(year) |>
summarize(
avgVoteDay = coef(lm(votesneeded ~ daysuntilelection))[2]
)
Output is slightly differs from yours:
# A tibble: 4 x 2
year avgvote_day
<int> <dbl>
1 2018 7.76
2 2019 7.40
3 2020 6.41
4 2021 4.74