Search code examples
rlmrate

How to create rate on R


I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this: Coefficients:

             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.087061   0.029876   2.914  0.00438 **
intersection 0.009192   0.003086   2.978  0.00362 ** 

Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.

Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.

> head(c)
# A tibble: 6 x 15
  STATE STATENAME  PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME     CITY   DAY MONTH  YEAR LATITUDE LONGITUD
  <dbl> <chr>     <dbl>   <dbl>   <dbl>      <dbl>  <dbl> <chr>         <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1     1 Alabama       0       3       3          0     81 LEE (81)       2340     7     2  2019     32.7    -85.3
2     1 Alabama       0       2       2          0     55 ETOWAH (55)    1280    23     1  2019     34.0    -86.1
3     1 Alabama       0       4       4          0     29 CLEBURNE (29)     0    22     1  2019     33.7    -85.4
4     1 Alabama       1       1       1          1     55 ETOWAH (55)    2562    22     1  2019     34.0    -86.1
5     1 Alabama       0       1       1          0      3 BALDWIN (3)       0    18     1  2019     30.7    -87.8
6     1 Alabama       0       2       2          0     85 LOWNDES (85)      0     7     1  2019     32.2    -86.4
# … with 1 more variable: FATALS <dbl>

Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)

#Libraries
rm(list=ls()) # this is to clear anything  in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)

setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")

state <- unique(df$STATE)
for(i in state){
  df1<- df %>%
    filter(STATE==i) %>%
    dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
    mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
    group_by(date) %>% # Group by State id and date
    # summarise_at(.vars = vars(PEDS), sum)
    summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365) 
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>% 
    ggplot()+
    geom_line(aes(x=date,y=pedday))+ylim(0,13)+
    theme_bw()
    ggsave(paste0("state_",i,".png"),width=8,height=6, )
  ts1<-ts(df1,start=c(2019,1,1), frequency=365)
  setwd("~/Desktop/Statistics790/States_ts")
  ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))

for(i in state){
  print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
  df <- read.csv(paste0("state_",i,".csv"))
  print("------Linear Model------")
  print(summary(lm(pedday~intersection,data=df)))
}

Solution

  • Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.

    pop <- read.csv('census.csv')
    df <- read.csv('accidents.csv') %>% 
       left_join(pop, by = c('STATENAME' = 'Geographic Area') %>% 
       mutate(rate = (PEDS / X2019) * <scale>) %>%
       mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
    

    The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.