Search code examples
rvalidationdataframedata-cleaningoutliers

How to check for errors/outliers in large data in R?


I have a data frame with 32 million rows. Each row is an account number with around 120 columns, mostly filled with numbers and dates.

What would be a good way to check all the columns for outliers/errors/wrong inputs efficiently?

For example I have a column with House Value. I could plot it and look for any spikes, however it takes some time to generate a plot for so many points.


Solution

  • If you are interested in doing this using a multidimensional measure, you can use Mahalanobis distance (M-dist). M-dist is a multidimensional way of measuring a point P a distance away from mean D. To use this you can use the following code:

    library(tidyverse)
    data %>% select_if(is.numeric) %>% mahalanobis(center = colMeans(.), cov = cov(.))
    

    If you are looking to do each column independent of all other columns then you can use

    library(dplyr)
    library(tidyr)
    library(purrr)
    outlierremoval <- function(dataframe){
     dataframe %>%
          select_if(is.numeric) %>% #selects on the numeric columns
          map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
          # not clear whether we need to output as a list or data.frame
          # if it is the latter, the columns could be of different length
          # so we may use cbind.fill
          # { do.call(rowr::cbind.fill, c(., list(fill = NA)))}
    
     }
    
    outlierremoval(Clean_Data)
    

    This last one comes from: How to get outliers for all the columns in a dataframe in r