Search code examples
rfor-loopapplyoutliers

Using apply() and its correlaries to replace loops to apply outlier test (e.g.)


I have data from a behavioral task that looks something like this (assuming data frame named data):

data <- data.frame(subject = c(rep(8666, 6), rep(5452, 6)), RT = c(714, 877, 665, 854, 1092, 1960, 770, 4551, 1483, 1061, 755, 1090))
data
subject  RT
8666      714
8666      877
8666      665
8666      854
8666     1092
8666     1960
5452      770
5452     4551
5452     1483
5452     1061
5452      755
5452     1090

That is, for this question, I'm working with a selection of subjects and reaction times. (All told, 183 subjects with 156 trials each.) Using reshape's cast() function, I've calculated a value for each subject that I'd like to use to exclude certain trials.

outl <- function(x) {
    2.5 * mad(x) + median(x)
    }
melteddata <- melt(data, id.vars="subject", measure.vars = "RT")
outliers <- cast(melteddata, subject ~ ., outl)
colnames(outliers)[2] <- "outlier"

This outputs something like:

  subject    outlier
1    5452   2235.635
2    8666   1517.844
...

Now, the way I'd normally do this is to write a loop which, for each unique subject number, compares their RT to the outlier value for that subject:

data$outliers <- 0
for(subject in unique(data$subject)) {
    temp <- data[data$subject == subject,]
    temp$outliers <- ifelse(temp$RT > outliers[outliers$subject == subject,]$outlier, 0, 1)
    data[data$subject == subject,]$outliers <- temp$outliers
    }

... which marks the RTs of 1960 for subject 8666 and 4551 for 5452 as outliers.

However, I feel like there's got to be a more R way to do this. It feels like apply() should be able to do the same thing, and certainly this takes a long time to run as a loop. Any suggestions?

Edit: I realize I can do this with ddply() from the library(plyr) package instead of using melt() and cast():

library(plyr)
outliers <- ddply(data, .(subject), summarize, median = median(RT), mad = mad(RT), outlier = median(RT) + 2.5 * mad(RT))

Solution

  • Here's a try. Turn the outliers data frame into a named vector:

    out <- outliers$outlier
    names(out) <- outliers$subject
    

    Then use it as a lookup table to select all the rows of data where the RT column is less than the outlier value for the subject:

    data[data$RT < out[as.character(data$subject)], ]
    

    The as.character is necessary since the subject IDs are integers, and you don't want to get, e.g., the 8666th element of out.

    Edit to add a dplyr solution:

    group_by(data, subject) %>% summarize(outlier = 2.5 * mad(RT) + median(RT)) -> outliers
    merge(data, outliers)
    filter(data, RT < outlier)