I am trying to write a code that extracts the ID of the teacher who assigned the same mid_term score to their students but then gave different scores on their finals.
While I'm interested in identifying the case where mid_term scores were the same for different final_scores, I am not interested in the converse relation i.e., it doesn't matter to me if the same final_scores are given for different mid_term scores.
My input df
looks something like this: :
TUserId SUID mid_sum final_sum
115 201 7 1
115 309 7 2
209 245 10 2
209 398 10 2
209 510 10 3
302 423 8 1
302 456 8 1
412 189 6 1
412 191 7 1
In the output I am just looking for a way to have the following flags with the teacher IDs
TUserId Flag
115 inconsistent
209 inconsistent
302 consistent
412 consistent
OR
TUserId Flag
115 TRUE
209 TRUE
302 FALSE
412 FALSE
When the value of mid_sum is the same across different students (shown by SUID
) of the same teacher (shown by TUserId
), it puts inconsistent in an additional column. I'm mainly looking for extracting the teacher ID where such discrepancy appeared.
This is related to the following two posts but one of the solutions meet my requirements.
Rolling computation to identify mismatch between two columns and Finding if a value is within the range of other columns
Any help on this would be greatly appreciated.
Third attempt :-)
do.call(rbind, by(dat, dat$TUserId, FUN = function(z) {
data.frame(
TUserId = z$TUserId[1],
Flag = any(table(unique(subset(z, select=c("mid_sum","final_sum")))$mid_sum) > 1)
)
}))
# TUserId Flag
# 115 115 TRUE
# 209 209 TRUE
# 302 302 FALSE
# 412 412 FALSE
Walk-through, inside-out:
unique(subset(..))
reduces the number of unique combinations of mid_sum
and final_sum
; the premise that either they should both be the same or both be different is the key here;table(unique(..)$mid_sum)
counts the number of different values; since we've reduced all of the mid/final pairs, if we see a mid_sum
value that occurs more than once, it means that we have something wrong here;any(..)
will return a single TRUE
for this one teacher (since we're grouping ... see by
below) if any of the rows show inconsistency, we don't care which one;data.frame(..)
is just a convenience for combining afterwards, there are certainly different ways to deal with the expected output;by(dat, dat$TUserId, ..)
groups the data by each teacher, and passes to the anonymous function as its z
parameters; the first time this function is called, all it sees as z
is dat[1:2,]
(id 115); the second time this anon-func is called, all it sees as z
is dat[3:5,]
(id 209); etc;do.call(rbind, ..)
: the return value from by
is a list
, and because of the data.frame(.)
above, it is a list of frames; the best and fastest way to combine this list of frames into a single frame is to do this trick. There are other similar and/or better methods out of base R (e.g., dplyr::bind_rows
and data.table::rbindlist
).