Search code examples
rdataframedplyrdata-manipulationoutliers

Outlier Detection on common variable name in R


I've would like to create a column to detect outliers in a large dataset. A sample of the dataset is

Item Cost
Book A 5.0
Book A 3.5
Book A 12.0
Book B 6.0
Book B 8.0
Book C 3.0
Book B 6.0
Book C 3.5
Book A 3.8
Book B 13.0
Book A 5.1
Book B 7.0
Book A 11.5
Book C 3.8
Book A 5.5
Book A 6.5
Book B 13.5
Book A 5.5
Book C 3.5
Book A 1.2
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L, 
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A", 
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12, 
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5, 
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))

I would like to detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column as follows

Item Cost Outlier (Y/N)
Book A 5.0 N
Book A 3.5 N
Book A 12.0 Y
Book B 6.0 N
Book B 8.0 N
Book C 3.0 N
Book B 6.0 N
Book C 3.5 N
Book A 3.8 N
Book B 13.0 Y
Book A 5.1 N
Book B 7.0 N
Book A 11.5 Y
Book C 3.8 N
Book A 5.5 N
Book A 6.5 N
Book B 13.5 Y
Book A 5.5 N
Book C 3.5 N
Book A 1.2 Y

Thanks in advance!


Solution

  • library(dplyr)
    
    df %>%
      summarise(outlier = mean(Cost), 
                offset = outlier * 0.6, 
                higher_value = outlier + offset, 
                lower_value = outlier - offset) %>%
      bind_cols(df) %>%
      transmute(Item, Cost, 
                Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
    
    #     Item Cost Outlier
    #1  Book A  5.0       N
    #2  Book A  3.5       N
    #3  Book A 12.0       Y
    #4  Book B  6.0       N
    #5  Book B  8.0       N
    #6  Book C  3.0       N
    #7  Book B  6.0       N
    #8  Book C  3.5       N
    #9  Book A  3.8       N
    #10 Book B 13.0       Y
    #11 Book A  5.1       N
    #12 Book B  7.0       N
    #13 Book A 11.5       Y
    #14 Book C  3.8       N
    #15 Book A  5.5       N
    #16 Book B  6.5       N
    #17 Book B 13.5       Y
    #18 Book A  5.5       N
    #19 Book C  3.5       N
    #20 Book A  1.2       Y
    

    If you want to find the outlier for each Item separately.

    df %>%
      group_by(Item) %>%
      summarise(outlier = mean(Cost), 
                offset = outlier * 0.6, 
                higher_value = outlier + offset, 
                lower_value = outlier - offset) %>%
      left_join(df, by = 'Item') %>%
      transmute(Item, Cost, 
                Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))