I have data on all the US Hospitals regarding certain attributes. A representative sample may look like the one shown below:
Hospital.Name State HA HF PN Rank
1 HOSPITAL 1 NY 10.1 7.1 7.7 1
2 HOSPITAL 2 PA 10.4 11.2 11.6 2
3 HOSPITAL 3 SD 10.5 9.9 11.6 3
4 HOSPITAL 4 CA 10.5 8.4 9.7 4
5 HOSPITAL 5 CT 10.6 11.9 11.8 5
6 HOSPITAL 6 PA 10.7 8.3 10.9 6
The data shown above is already part processed - arranged and ranked by the attribute "HA" and a column ("Rank") has been added to reflect that.
However, this is not what I need. I am trying to rank this data, grouped by state, for these attributes.
So, the final output would be somewhat like below:
Hospital.Name State HA HF PN Rank
1 HOSPITAL 1 AK 12.1 7.1 7.7 1
2 HOSPITAL 2 AK 12.4 11.2 11.6 2
3 HOSPITAL 3 AK 14.5 9.9 11.6 3
4 HOSPITAL 4 AL 11.5 8.4 9.7 1
5 HOSPITAL 5 AL 15.6 11.9 11.8 2
6 HOSPITAL 6 AL 16.7 8.3 10.9 3
Please note that value of attribute "HA" for rank 1 of a particular state could be higher or lower than value of attribute HA for rank 1 of another state. Essentially, the ranks for the attribute "HA", within states, are independent.
So, far, I have tried using order and rank functions, nestled, within mutate function. However, both throw up the following error:
Error: incompatible size (%d), expecting %d (the group size) or 1
How can I achieve what I need? Any help / guidance / advice would be highly appreciated.
**Please note that I have data for all the 50 states running into thousands of rows and what is shown in table 1 above is only representative data.
Also, please note that the data in Table 2 above is a sample of the desired output and not the whole output. **
Here is another solution using dplyr
. You can group by the State and then do the ranking:
dt <- data.table(Hospital.Name = "Hospital",
State = rep(c("AK", "AL"), each = 2),
HA = c(10,12,11,14))
dt %>%
group_by(State) %>%
mutate(Rank = rank(HA))
output:
Hospital.Name State HA Rank
1 Hospital AK 10 1
2 Hospital AK 12 2
3 Hospital AL 11 1
4 Hospital AL 14 2
One more word on the confusion that came up:
rank(dt$HA)
gives a vector (here size 4) whereas rank(HA)
gives a scalar which represents the current rank of the current row within the column HA (also on the group basis if specified).