Search code examples
rdplyrdata-manipulationoutliers

How do I get the index of filtered observations?


CD <- filter(CD, AGE <= quantile(AGE,probs=.75))

CD <- filter(CD, AMOUNT <= quantile(AMOUNT,probs=.75))

I'm attempting to remove outliers. I want to remove outliers for these two variables (possibly more).

Instead of removing outliers from one variable and another afterward-- which could potentially remove observations which were never outliers due to filtering a first time -- how can I grab the index placements of these outliers?

This way, I can simply select all indexes that are not included in the list that was just coded.


Solution

  • in base R. This expression returns the index.

    # Return index positions
    which(CD$AMOUNT <= quantile(CD$AMOUNT, probs = .75))
    
    # Return subsetted table
    CD[which(CD$AMOUNT <= quantile(CD$AMOUNT, probs = .75)),]
    

    Here's an example to return only the outliers that are present in all columns based on a condition.

    set.seed(10)
    
    CD = data.frame(AMOUNT = runif(100)*100,
                    AGE = runif(100)*25,
                    RATE = runif(100)*20)
    
    # Return all the indexes that match the condition below
    ids = sort(Reduce(union, lapply(CD, function(x) {
     which(x <= quantile(x, probs = .50))
    })))
    
    # Return the outliers
    CD[-ids,]
    

    As you can see below we have the median value for each column.

    > lapply(CD, function(x) {quantile(x, probs = .50)})
    $AMOUNT
         50% 
    46.25863 
    
    $AGE
         50% 
    14.06169 
    
    $RATE
         50% 
    12.11707 
    
    # The table below satisfies the condition
    # where all values in each row are greater than the median of each column.
    
    > CD[-ids,]
         AMOUNT      AGE     RATE
    11 65.16557 19.60416 14.26684
    12 56.77378 21.06740 13.41244
    29 77.07715 21.42183 16.44666
    37 82.26526 17.82989 17.90719
    40 50.05032 13.75819 14.70880
    48 48.61003 15.31302 17.91584
    50 80.15470 22.60330 14.68005
    72 53.44268 19.04492 19.90746
    73 64.13566 18.63244 13.79634
    87 79.79930 21.90065 18.58642
    94 51.71569 12.93184 19.87585