Search code examples
rcomparisonrowmatching

Rolling comparison (per row) between two columns


I am trying to write a code that extracts the ID of the teacher who assigned the same mid_term score to their students but then gave different scores on their finals.

While I'm interested in identifying the case where mid_term scores were the same for different final_scores, I am not interested in the converse relation i.e., it doesn't matter to me if the same final_scores are given for different mid_term scores.

Input:

My input df looks something like this: :

TUserId  SUID   mid_sum final_sum
 115      201   7       1
 115      309   7       2
 209      245   10      2
 209      398   10      2
 209      510   10      3
 302      423   8       1
 302      456   8       1
 412      189   6       1
 412      191   7       1
Output:

In the output I am just looking for a way to have the following flags with the teacher IDs

TUserId  Flag
 115     inconsistent
 209     inconsistent
 302     consistent
 412     consistent

OR

TUserId  Flag
 115     TRUE
 209     TRUE
 302     FALSE
 412     FALSE

Requirements:

When the value of mid_sum is the same across different students (shown by SUID) of the same teacher (shown by TUserId), it puts inconsistent in an additional column. I'm mainly looking for extracting the teacher ID where such discrepancy appeared.

This is related to the following two posts but one of the solutions meet my requirements.

Rolling computation to identify mismatch between two columns and Finding if a value is within the range of other columns

Any help on this would be greatly appreciated.


Solution

  • Third attempt :-)

    do.call(rbind, by(dat, dat$TUserId, FUN = function(z) {
      data.frame(
        TUserId = z$TUserId[1],
        Flag = any(table(unique(subset(z, select=c("mid_sum","final_sum")))$mid_sum) > 1)
      )
    }))
    #     TUserId  Flag
    # 115     115  TRUE
    # 209     209  TRUE
    # 302     302 FALSE
    # 412     412 FALSE
    

    Walk-through, inside-out:

    • unique(subset(..)) reduces the number of unique combinations of mid_sum and final_sum; the premise that either they should both be the same or both be different is the key here;
    • table(unique(..)$mid_sum) counts the number of different values; since we've reduced all of the mid/final pairs, if we see a mid_sum value that occurs more than once, it means that we have something wrong here;
    • any(..) will return a single TRUE for this one teacher (since we're grouping ... see by below) if any of the rows show inconsistency, we don't care which one;
    • data.frame(..) is just a convenience for combining afterwards, there are certainly different ways to deal with the expected output;
    • by(dat, dat$TUserId, ..) groups the data by each teacher, and passes to the anonymous function as its z parameters; the first time this function is called, all it sees as z is dat[1:2,] (id 115); the second time this anon-func is called, all it sees as z is dat[3:5,] (id 209); etc;
    • do.call(rbind, ..): the return value from by is a list, and because of the data.frame(.) above, it is a list of frames; the best and fastest way to combine this list of frames into a single frame is to do this trick. There are other similar and/or better methods out of base R (e.g., dplyr::bind_rows and data.table::rbindlist).