I have data from a behavioral task that looks something like this (assuming data frame named data):
data <- data.frame(subject = c(rep(8666, 6), rep(5452, 6)), RT = c(714, 877, 665, 854, 1092, 1960, 770, 4551, 1483, 1061, 755, 1090))
data
subject RT
8666 714
8666 877
8666 665
8666 854
8666 1092
8666 1960
5452 770
5452 4551
5452 1483
5452 1061
5452 755
5452 1090
That is, for this question, I'm working with a selection of subjects and reaction times. (All told, 183 subjects with 156 trials each.) Using reshape's cast() function, I've calculated a value for each subject that I'd like to use to exclude certain trials.
outl <- function(x) {
2.5 * mad(x) + median(x)
}
melteddata <- melt(data, id.vars="subject", measure.vars = "RT")
outliers <- cast(melteddata, subject ~ ., outl)
colnames(outliers)[2] <- "outlier"
This outputs something like:
subject outlier
1 5452 2235.635
2 8666 1517.844
...
Now, the way I'd normally do this is to write a loop which, for each unique subject number, compares their RT to the outlier value for that subject:
data$outliers <- 0
for(subject in unique(data$subject)) {
temp <- data[data$subject == subject,]
temp$outliers <- ifelse(temp$RT > outliers[outliers$subject == subject,]$outlier, 0, 1)
data[data$subject == subject,]$outliers <- temp$outliers
}
... which marks the RTs of 1960 for subject 8666 and 4551 for 5452 as outliers.
However, I feel like there's got to be a more R way to do this. It feels like apply() should be able to do the same thing, and certainly this takes a long time to run as a loop. Any suggestions?
Edit: I realize I can do this with ddply() from the library(plyr) package instead of using melt() and cast():
library(plyr)
outliers <- ddply(data, .(subject), summarize, median = median(RT), mad = mad(RT), outlier = median(RT) + 2.5 * mad(RT))
Here's a try. Turn the outliers data frame into a named vector:
out <- outliers$outlier
names(out) <- outliers$subject
Then use it as a lookup table to select all the rows of data where the RT column is less than the outlier value for the subject:
data[data$RT < out[as.character(data$subject)], ]
The as.character
is necessary since the subject IDs are integers, and you don't want to get, e.g., the 8666th element of out
.
Edit to add a dplyr
solution:
group_by(data, subject) %>% summarize(outlier = 2.5 * mad(RT) + median(RT)) -> outliers
merge(data, outliers)
filter(data, RT < outlier)