Search code examples
sqlrsassumunique

How can I sum the unique values in a variable by the observations in another variable?


I have a dataset containing information on hospital visits for certain years. One of the variables contains county codes for counties that the visits occurred in and another variable contains the year for each visit that occurred. What I'm trying to do is sum how many unique counties hospital visits occurred in per year. I have access to SAS code which does this and I'm trying to duplicate it somehow in R.

The SAS code is:

proc sql;
    select ED_VISIT_YEAR, count(distinct COUNTY_CD) as Counties from dataset
    group by ED_VISIT_YEAR;
quit; run;

I have tried using the aggregate and unique functions along with sum, but I always get errors.

The only thing I have got to work is this:

Check <- as.matrix(unique(ED_a$COUNTY_CD,ED_a$ED_VISIT_YEAR))

Which produces outcome like this:

  V1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 1185

what I'm looking for is something like this:

ED_VISIT_YEAR  Counties
2005              16
2006              16
2007              16
2008              16

I'm still fairly new to R and this has me stumped so any help would be greatly appreciated.


Solution

  • Using dplyr:

    ED_a %>% group_by(ED_VISIT_YEAR) %>% tally(name="counties")
    

    If you don't want to use dplyr, a base solution could be this:

    countysummaries<-as.data.frame(counties=sapply(split(ED_a,ED_a$ED_VISIT_YEAR),nrow)
    

    Note that this will give you the years as rownames rather than a column, so you might want to do this afterwards:

    countysummaries$ED_VISIT_YEAR<-rownames(countysummaries)
    

    (the tidyverse also has rownames_to_column, but here we're looking for a base only solution).