Search code examples
pythonrpandasdplyrfuzzy-search

Fuzzy match column headers for many data.frames with same number of columns?


I have ~6000 data.frames with the same number of columns, but human-labelled headers, meaning there are things like typos, and occasionally an extra word (e.g. what should be address could appear as street_address).

Note that all the column names are fairly distinct (the closest are first_name and last_name), but all the others have no overlap in words

Is there an established 'best practice' for matching the column headers so that they can be organised into one single dataframe?

What I guess so far is to simply match the columns by the number of characters that match (e.g. street_address would probably match correctly to address since 7 characters match)

Is there a better / more established / more reliable way?

Note: I can use either R (preferably dplyr), or python (e.g. pandas) for this (as well as any other libraries in either language)


Solution

  • Here is a got at it...

    sample data

    df1 <- data.frame( first.name = c( "bobby", "carl" ),
                       last_name = c( "fisscher", "sagan") )
    df2 <- data.frame( lst_name = c("ice","cream"),
                       frst_name = c("ben","jerry") )
    df3 <- data.frame( first_nam = c("bert", "ernie"),
                       last_nam = c("elmo", "oscar"))
    
    
    df1;df2;df3 
    
    # first.name last_name
    # 1      bobby  fisscher
    # 2       carl     sagan
    
    # lst_name frst_name
    # 1      ice       ben
    # 2    cream     jerry
    
    # first_nam last_nam
    # 1      bert     elmo
    # 2     ernie    oscar
    

    code

    library( stringdist )
    library( data.table )
    
    #add all data.frames to a list
    L <- list(df1,df2,df3)
    
    #reorder the df's, based on the stringdistance from 
    #  the columnnames of df_n with those of df1
    data.table::rbindlist(
      lapply( L, function(x) {
        #get stringdistance matrix of colnames
        temp <- stringdistmatrix( names(df1), names(x), useNames = TRUE )
        #get the colname of x that matches the one of df1 closest
        colOrder <- colnames(x)[apply(temp,1,which.min)]
        #reorder x accordingly
        x[, colOrder ]
      }),
      #rowbind, ignoring the columnnames, the order is all that matters
      use.names = FALSE )
    
    #    first.name last_name
    # 1:      bobby  fisscher
    # 2:       carl     sagan
    # 3:        ben       ice
    # 4:      jerry     cream
    # 5:       bert      elmo
    # 6:      ernie     oscar