Search code examples
rsassummary

Summarising status from multiple rows in R


I'd like to know how to summarise the status of a particular ID across multiple rows. I know how to do this in SAS with a BY statement, first. and last. but unsure how to achieve the same thing in R.

Example of data that needs to be summarise, i.e. for each ID, is the colour column: "red only", "blue only", or "both"

Generate data frame

example <- data.frame(id = c("A1", "A1", "A1", "A2", "A3", "A3", "A4", "A4", "A4", "A5", "A5", "A6"),
                      colour = c("red", "red", "blue", "red", "blue", "blue", "red", "red", "red", "red", "blue", "red"))

Output table

   id colour
1  A1    red
2  A1    red
3  A1   blue
4  A2    red
5  A3   blue
6  A3   blue
7  A4    red
8  A4    red
9  A4    red
10 A5    red
11 A5   blue
12 A6    red

Desired result

  id    status
1 A1      both
2 A2  red only
3 A3 blue only
4 A4  red only
5 A5      both
6 A6  red only

Equivalent Code in SAS would be:

data table1 (keep=id status);
set example;
by id;
retain red_count blue_count;

if first.id then do;
  red_count = 0;
  blue_count = 0;
end;

if colour = "red" then red_count+1;
if colour = "blue" then blue_count+1;

if last.id then do;
  if red_count > 0 and blue_count > 0 then status = "both";
  else if red_count > 0  then status = "red only";
  else if blue_count > 0 then status = "blue only";
  output;
end;

run;

Solution

  • You can use dplyr to make this pretty straight forward

    library(dplyr)
    example %>% 
      group_by(id) %>% 
      summarize(status = case_when(
        all(colour=="red") ~ "red only",
        all(colour=="blue") ~ "blue only",
        TRUE ~ "both"
      ))
    #   id    status   
    #   <chr> <chr>    
    # 1 A1    both     
    # 2 A2    red only 
    # 3 A3    blue only
    # 4 A4    red only 
    # 5 A5    both     
    # 6 A6    red only