Search code examples
rgroupingranking

R: Identify highest ranking observation in groups of unequal size


The following code generates 2-level groups (by State within Test), then ranks each observation within each group based on ascending order of Grade. School is the tie-breaker.

School<-rep(c("A","B","C","D"),each=10)
State<-rep(c("NY","NJ"),times=20)
Test<-rep(c("LSAT", "MCAT", "GRE","TOEFL","ACT"), times=8)
Grade<-trunc(rep((seq(from=500, to=600,length.out=4))))
dat<-data.frame(Test,State,School,Grade)
library(plyr)
dat<-ddply(dat, .(Test, State),transform,num=rank(Grade,ties.method="first"))

I convert the first-ranked item within each group to "lowest" using the following code:

dat$num[dat$num==1]<-"lowest"

In this sample df, the number of items per group is always 4, so I can convert the highest-ranked item in each group to "highest" using the following code:

dat$num[dat$num==4]<-"highest"

But how can I tag observations with "highest" when the number of rows is not constant across all groups? The following code creates a version of the df with two additional rows in one of the groups.

School<-rep(c("A","B","C","D"),each=10)
State<-rep(c("NY","NJ"),times=20)
Test<-rep(c("LSAT", "MCAT", "GRE","TOEFL","ACT"), times=8)
Grade<-trunc(rep((seq(from=500, to=600,length.out=4))))
dat1<-data.frame(Test,State,School,Grade) 
dat1<-rbind(dat1,
     data.frame(Test="ACT",State="NJ",School="E",Grade=550),
     data.frame(Test="ACT",State="NJ",School="F",Grade=650))
library(plyr)
dat1<-ddply(dat1, .(Test, State),transform,num=rank(Grade,ties.method="first"))

Solution

  • You can do that by checking which is the highest/lowest in each group and assign highest/lowest to those rows. Here, I use ddply to do that since you already use plyr in your code:

    dat1 <- ddply(dat1, .(Test, State), transform, num=ifelse(num == max(num), "highest", 
                                                              ifelse(num == min(num), "lowest", num)))
    
    > dat1
        Test State School Grade     num
    1    ACT    NJ      A   533  lowest
    2    ACT    NJ      B   600       4
    3    ACT    NJ      C   533       2
    4    ACT    NJ      D   600       5
    5    ACT    NJ      E   550       3
    6    ACT    NJ      F   650 highest
    7    ACT    NY      A   500  lowest
    8    ACT    NY      B   566       3
    9    ACT    NY      C   500       2
    10   ACT    NY      D   566 highest
    11   GRE    NJ      A   600       3
    12   GRE    NJ      B   533  lowest
    13   GRE    NJ      C   600 highest
    14   GRE    NJ      D   533       2
    15   GRE    NY      A   566       3
    16   GRE    NY      B   500  lowest
    17   GRE    NY      C   566 highest
    18   GRE    NY      D   500       2
    19  LSAT    NJ      A   533  lowest
    20  LSAT    NJ      B   600       3
    21  LSAT    NJ      C   533       2
    22  LSAT    NJ      D   600 highest
    23  LSAT    NY      A   500  lowest
    24  LSAT    NY      B   566       3
    25  LSAT    NY      C   500       2
    26  LSAT    NY      D   566 highest
    27  MCAT    NJ      A   533  lowest
    28  MCAT    NJ      B   600       3
    29  MCAT    NJ      C   533       2
    30  MCAT    NJ      D   600 highest
    31  MCAT    NY      A   566       3
    32  MCAT    NY      B   500  lowest
    33  MCAT    NY      C   566 highest
    34  MCAT    NY      D   500       2
    35 TOEFL    NJ      A   600       3
    36 TOEFL    NJ      B   533  lowest
    37 TOEFL    NJ      C   600 highest
    38 TOEFL    NJ      D   533       2
    39 TOEFL    NY      A   500  lowest
    40 TOEFL    NY      B   566       3
    41 TOEFL    NY      C   500       2
    42 TOEFL    NY      D   566 highest
    

    If your data is large enough, you could also consider using dplyr or data.table which will be faster than plyr.