Search code examples
rmerging-data

R: finding mismatched column names before merging


I have a large number of data frames to merge. Each has several hundred columns. I want to identify all mismatching column names before doing this. So far I can generate a list of mismatches, but format is terrible and I can't quite figure out how to tell which data frame they come from.

#create data
df1 <- data.frame("col1" = 3:4, "Age" = c(22,16), "Name" = c("James","Jim"))
df2 <- data.frame("col1" = 3:4, "Age" = c(18,19), "Name" = c("Mike","Mia"))
df3 <- data.frame("mismatch_col_name_1" = 1:2, "Age" = c(21,15), "name" = c("John","Dora"))
df4 <- data.frame("mismatch_col_name_2" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
files <- list(df1, df2, df3, df4)

# find mismatched column names
mismatches <- NULL
for (i in 1:(length(files) - 1)) {
  mismatches <- c(mismatches, setdiff(colnames(files[[i]]), colnames(files[[i+1]])))
}
mismatches <- c(mismatches, setdiff(colnames(files[[length(files)]]), colnames(files[[1]])))
print(mismatches)

[1] "col1"                "Name"                "mismatch_col_name_1" "name"               
[5] "mismatch_col_name_2"

Desired output would be something like:

"df3" "mismatch_col_name_1" "name"

"df4" "mismatch_col_name_2" "Name"

Or even df names and column numbers. Interested in any solutions or better ways to do this.


Solution

  • Here's an approach that gets you to a list (in the R sense) containing mismatches for each file. It is predicated on the assumption that you know the "true" set of names against which to compare each file.

    lapply(files, function(x) {
    
        # vector of desired names
        master <- c('col1', 'Age', 'Name')
    
        # use 'match' to compare this df's names to the master. the order of the
        # cols won't matter; if the name in x appears in master, 'match' will return
        # an integer indicating the position of the col with that name in x.
        comparison <- match(names(x),  master)
    
        # if all col names in x appear in master, you get all integers, so: NULL
        if (!any(is.na(comparison))) {
    
            NULL
    
        # if names in x don't appear in master, you get an NA from 'match', so here you
        # create a vector of the names in x that aren't in master. You could also capture
        # their position here if that's helpful.
        } else {
    
            mismatches <- names(x)[which(is.na(comparison))]
    
        }
    
    })
    

    Result:

    [[1]]
    NULL
    
    [[2]]
    NULL
    
    [[3]]
    [1] "mismatch_col_name_1" "name"               
    
    [[4]]
    [1] "mismatch_col_name_2"
    

    There are various ways you could then organize or summarize the contents of this list, but that's mostly a matter of formatting.