I have a data frame with 32 million rows. Each row is an account number with around 120 columns, mostly filled with numbers and dates.
What would be a good way to check all the columns for outliers/errors/wrong inputs efficiently?
For example I have a column with House Value. I could plot it and look for any spikes, however it takes some time to generate a plot for so many points.
If you are interested in doing this using a multidimensional measure, you can use Mahalanobis distance (M-dist). M-dist is a multidimensional way of measuring a point P a distance away from mean D. To use this you can use the following code:
library(tidyverse)
data %>% select_if(is.numeric) %>% mahalanobis(center = colMeans(.), cov = cov(.))
If you are looking to do each column independent of all other columns then you can use
library(dplyr)
library(tidyr)
library(purrr)
outlierremoval <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>% #selects on the numeric columns
map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
# not clear whether we need to output as a list or data.frame
# if it is the latter, the columns could be of different length
# so we may use cbind.fill
# { do.call(rowr::cbind.fill, c(., list(fill = NA)))}
}
outlierremoval(Clean_Data)
This last one comes from: How to get outliers for all the columns in a dataframe in r