r validation dataframe data-cleaning outliers

How to check for errors/outliers in large data in R?

I have a data frame with 32 million rows. Each row is an account number with around 120 columns, mostly filled with numbers and dates.

What would be a good way to check all the columns for outliers/errors/wrong inputs efficiently?

For example I have a column with House Value. I could plot it and look for any spikes, however it takes some time to generate a plot for so many points.

Solution

If you are interested in doing this using a multidimensional measure, you can use Mahalanobis distance (M-dist). M-dist is a multidimensional way of measuring a point P a distance away from mean D. To use this you can use the following code:

library(tidyverse)
data %>% select_if(is.numeric) %>% mahalanobis(center = colMeans(.), cov = cov(.))

If you are looking to do each column independent of all other columns then you can use

library(dplyr)
library(tidyr)
library(purrr)
outlierremoval <- function(dataframe){
 dataframe %>%
      select_if(is.numeric) %>% #selects on the numeric columns
      map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
      # not clear whether we need to output as a list or data.frame
      # if it is the latter, the columns could be of different length
      # so we may use cbind.fill
      # { do.call(rowr::cbind.fill, c(., list(fill = NA)))}

 }

outlierremoval(Clean_Data)

This last one comes from: How to get outliers for all the columns in a dataframe in r