I have ~6000 data.frames with the same number of columns, but human-labelled headers, meaning there are things like typos, and occasionally an extra word (e.g. what should be address
could appear as street_address
).
Note that all the column names are fairly distinct (the closest are first_name
and last_name
), but all the others have no overlap in words
Is there an established 'best practice' for matching the column headers so that they can be organised into one single dataframe?
What I guess so far is to simply match the columns by the number of characters that match (e.g. street_address
would probably match correctly to address
since 7 characters match)
Is there a better / more established / more reliable way?
Note: I can use either R (preferably dplyr), or python (e.g. pandas) for this (as well as any other libraries in either language)
Here is a got at it...
sample data
df1 <- data.frame( first.name = c( "bobby", "carl" ),
last_name = c( "fisscher", "sagan") )
df2 <- data.frame( lst_name = c("ice","cream"),
frst_name = c("ben","jerry") )
df3 <- data.frame( first_nam = c("bert", "ernie"),
last_nam = c("elmo", "oscar"))
df1;df2;df3
# first.name last_name
# 1 bobby fisscher
# 2 carl sagan
# lst_name frst_name
# 1 ice ben
# 2 cream jerry
# first_nam last_nam
# 1 bert elmo
# 2 ernie oscar
code
library( stringdist )
library( data.table )
#add all data.frames to a list
L <- list(df1,df2,df3)
#reorder the df's, based on the stringdistance from
# the columnnames of df_n with those of df1
data.table::rbindlist(
lapply( L, function(x) {
#get stringdistance matrix of colnames
temp <- stringdistmatrix( names(df1), names(x), useNames = TRUE )
#get the colname of x that matches the one of df1 closest
colOrder <- colnames(x)[apply(temp,1,which.min)]
#reorder x accordingly
x[, colOrder ]
}),
#rowbind, ignoring the columnnames, the order is all that matters
use.names = FALSE )
# first.name last_name
# 1: bobby fisscher
# 2: carl sagan
# 3: ben ice
# 4: jerry cream
# 5: bert elmo
# 6: ernie oscar