Search code examples
rdataframemultiple-columnscalculated-columns

summing a column based on values in two other columns


I have a data frame that lists individual mass shootings for each state between 1991-2020. I would like to 1) sum the total victims each year for each state, and 2) sum the total number of mass shootings each state had each year.

So far, I've only managed to get a total sum of victims between 1991-2020 for each state. And I'm not even sure how I could get a column with the total incidents per year, per state. Are there any adjustments I can make to the aggregate function, or is there some other function to get the information I want?

What I have:

combined = read.csv('https://raw.githubusercontent.com/bandcar/massShootings/main/combo1991_2020_states.csv')

> head(combined)
       state       date year fatalities injured total_victims
3342 Alabama 04/07/2009 2009          4       0             4
3351 Alabama 03/10/2009 2009         10       6            16
3285 Alabama 01/29/2012 2012          5       0             5
135  Alabama 12/28/2013 2013          3       5             8
267  Alabama 07/06/2013 2013          0       4             4
557  Alabama 06/08/2014 2014          1       4             5

q = aggregate(total_victims ~ state,data=combined,FUN=sum)

> head(q)
       state total_victims
1    Alabama           364
2     Alaska            19
3    Arizona           223
4   Arkansas           205
5 California          1816
6   Colorado           315

What I want for each state for each year:

  year      state total_victims total_shootings
1 2009    Alabama            20               2           
2 2012    Alabama             5               1
3 2013    Alabama            12               2
4 2014    Alabama             5               1

Solution

  • You can use group_by in combination with summarise() from the tidyverse packages.

    library(tidyverse)
    
    combined |> 
      group_by(state, year) |> 
      summarise(total_victims = sum(total_victims),
                total_shootings = n())
    

    This is the result you get:

    # A tibble: 457 x 4
    # Groups:   state [52]
       state    year total_victims total_shootings
       <chr>   <int>         <int>           <int>
     1 Alabama  2009            20               2
     2 Alabama  2012             5               1
     3 Alabama  2013            12               2
     4 Alabama  2014            10               2
     5 Alabama  2015            17               4