Search code examples
rstandard-deviation

how to calculate standard deviation for ZIP code & complaints in data in R Studio


Purpose

For each ZIP code need to calculate fraction of noise complaints due to descriptor that has construction description, then report the standard deviation for ZIP.

How to calculate fraction of noise complaints, AKA, 'Noise:' inside Data, and calculate the ZIP standard deviation?

Problem

How to calculate standard deviation sd() on zip that represent a fraction of the column for complaints (descriptor)

Not sure how to get standard deviation of ZIP with ZIP is not in fraction of descriptor. My first effort was to group by zip, by descriptor. Then to summarize n(). Not sure how to compute sd() for this format of data.

R Code

nyc_comp_set <- nyc_comp %>%
  select(incident_zip, city, descriptor)

nyc_comp_set$city <- factor(nyc_comp_set$city)
nyc_comp_set$descriptor <- factor(nyc_comp_set$descriptor)

nyc_comp_en <- one_hot(as.data.table(nyc_comp_set))

nyc_comp_const <- nyc_comp_set %>%
  select(incident_zip, city, descriptor) %>%
  filter(str_detect(nyc_comp_set$descriptor, "Construct")) %>%
  group_by(incident_zip) 

nyc_comp_const_gp <- nyc_comp_const %>%
  group_by(incident_zip, descriptor) %>%
  summarise  (nzip = n()) %>% 
  mutate(nyc_comp_const_gp$n <- n()) 

Perhaps organize by this code:

  group_by(incident_zip, descriptor) %>%
  summarise (n = n()) 

Data

Data is from 'nyc_noise_complaints.csv'. Here is a sample.

  incident_zip city     descriptor                                  
          <dbl> <fct>    <fct>                                       
 1        11231 BROOKLYN Noise: Construction Before/After Hours (NM1)
 2        10454 BRONX    Noise: Construction Before/After Hours (NM1)
 3        11234 BROOKLYN Noise: Construction Equipment (NC1)         
 4        11234 BROOKLYN Noise: Construction Equipment (NC1)         
 5        10462 BRONX    Noise: Construction Equipment (NC1)         
 6        10034 NEW YORK Noise: Construction Before/After Hours (NM1)
 7        10023 NEW YORK Noise: Construction Before/After Hours (NM1)
 8        11249 BROOKLYN Noise: Construction Before/After Hours (NM1)
 9        10001 NEW YORK Noise: Construction Before/After Hours (NM1)
10        10031 NEW YORK Noise: Construction Before/After Hours (NM1)

Solution

  • If you have a proportion p then the standard deviation is sqrt(p * (1 - p)). Something like this:

     nyc_comp %>%
      group_by(incident_zip, city) %>%
      summarize(prop_construction = mean(grepl("Construction", descriptor)), .groups = "drop") %>%
      mutate(sd_construction = sqrt(prop_construction * (1 - prop_construction)))