r function bigdata similarity data-cleaning

Data cleaning: Function to find very similar variables

I have some large data, which partly consists of very similar variables. Some variables have missing values (e.g. x3 and x5 in the example below) and some variables are similar, but with different labels (e.g. x2 and x5). In order to clean my data, I want to identify and eventually delete these similar variables. I am trying to write a function, which returns the column names of all similar variable pairs. Here is some exemplifying data:

# Example data

set.seed(222)

N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA

df <- data.frame(x1, x2, x3, x4, x5)

df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)

head(df)

As you can see, x1, x3, and x4 are very similar; and x2 and x5 are very similar as well. My function should print a list, which includes all pairs with the same values in 80% or more of the cases. Here is what I got so far:

# My attempt to write such a function

fun_clean <- function(data, similarity) {

  output <- list()
  data <- data[complete.cases(data), ]

  for(i in 1:ncol(data)) {

    if(i < ncol(data)) {

      for(j in (i + 1):ncol(data)) {

        similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)

        if(similarity_ij >= similarity) {

          output[[length(output) + 1]] <- colnames(data)[c(i, j)]

        }
      }
    }
  }

  output

}

fun_clean(data = df, similarity = 0.8)

I managed to identify the similarity of x1, x3, and x4. The similarity of x2 and x5 (i.e. similar variables with different labels) is not found by my function. Further, my function is very slow. Therefore, I have the following question:

Question: How could I identify all similar variables in a computationally efficient way?

Solution

In order to compare your columns, you need numeric values first. You can extract only the numeric values by using gsub() and then transform to numeric values. After this transformation, you'll be good to go:

df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))

Now you can compare all columns by first using combn(5, 2) to get all pairs of columns you want to compare. Then you can use that to compare the columns and calculate the percentage of entries that are equal.

combs <- combn(ncol(df), 2)

res <- apply(combs, 2, function(x){
  sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})

thresh <- 0.8
combs[, res > thresh]
#      [,1] [,2] [,3] [,4]
# [1,]    1    1    2    3
# [2,]    3    4    5    4

So columns 1 & 3, 1 & 4, 2 & 5 and 3 & 4 are equal to each other in more than 80% of the cases.

Note: If one or both of the compared values have an NA, this will be considered as not a match!