Search code examples
rfilterdplyrmeanstandard-deviation

Multiple filter arguments dplyr


I want to filter out multiple data errors in a huge (>20 000 points) dataset.

Here is a pretend dataset (EDIT: I simplified it significantly):

data<-data.table(age=c(1,1,1,2,2,2,3,3,4,4,4,4,4,4),wt=c(32,12,5,32,80,32,1,0,4,8,1,1,2,50))

In this hypothetical example, I want to exclude wt values >20 or <6 when age==1, then exclude any wt values +/- 1 SD from the mean for age 2-3 days, and then exclude any wt values +/- 2 SD from the mean for age 4.

EDIT Note that I am not trying to group ages 2-3 to extract 1 mean and 1 SD. Instead, I would like dplyr to extract the mean and SD at each age (2 and 3) individually and apply the same exclusion criteria over that range of ages.

I am generally familiar with dplyr and thought about tackling it like this (solution adapted from @Suran's answer that didn't work exactly as needed):

data_clean<-data%>%filter(
    !(age==1 & wt<6),
    !(age==1 & wt>20),
    !(age==2 & wt >= (mean((data%>%filter(age==2))$wt) +sd((data%>%filter(age==2))$wt))), 
    !(age==2 & wt <= (mean((data%>%filter(age==2))$wt)-sd((data%>%filter(age==2))$wt))),
    !(age==3 & wt >= (mean((data%>%filter(age==3))$wt) +sd((data%>%filter(age==3))$wt))), 
    !(age==3 & wt <= (mean((data%>%filter(age==3))$wt)-sd((data%>%filter(age==3))$wt))),
    !(age==4 & wt >= (mean((data%>%filter(age==4))$wt) +2*sd((data%>%filter(age==4))$wt))), 
    !(age==4 & wt <= (mean((data%>%filter(age==4))$wt)-2*sd((data%>%filter(age==4))$wt)))
        )

This is a really cumbersome solution and is not going to be feasible for me given I actually have 8 different exclusion criteria across multiple ages. Any suggestions on how I can bring this together?

EDIT: The desired final dataset would look like this:

     age  wt
      1  12
      2  32
      2  32
      3   1
      3   0
      4   4
      4   8
      4   1
      4   1
      4   2

Solution

  • To get the mean() for each age you need to first group_by(age) and scale() before doing the filter arguments.

    data_clean <- data
        group_by(age) %>%
        mutate(x = abs(scale(wt)[,1])) %>% #create a new variable that scales the wt, x<=1 means wt is within 1 SD of mean, x<=2 means wt is within 2 SD of mean
        ungroup() %>%
        filter((age==1 & wt %in% c(6:20) | #keep weights >6g and <20g at age==1
            age %in% c(2:3) & x <= 1 | #keep mean wts ± 1 SD for 2-3 days 
            age>=4 & x<=2) %>% #keep mean wts ± 2 SD for >=4 days
         select(-x)