I am looking for a good practical method of tackling metadata normalization between multiple files that have slightly different schema's for a batch ETL job in Talend.
I have a few hundred historical reports (around 25K to 200K records each) with about 100 to 150 columns per excel file. Most of the column names are the same for all the files (98% overlap) however there are subtle evil differences:
Short of writing a specialized application or brute forcing all the files by manually correcting them, are there any good free tools or methods that would provide a diff and correction between file column names in an intelligent or semi-automated fashion?
You could use Talend Open Studio
to achieve that. But I do see one caveat.
In order to make Talend understand your Excel files, you will need to first load it's metadata. The caveat is that you will need to load all metadata by hand (one by one). In the free version of Talend (Open Studio Data), there is no support for dynamic metadata.
Using components like tMap
you can then map your input metadata into your desired output metadata (could be a Excel file or a Database or something else). During this step you can shape your input data into your desired output (fixing / ignoring / transforming it / etc).
There seems to exist a user contributed component that offers support the Excel dynamic metadata. I did not test it, but it worth trying : http://www.talendforge.org/exchange/?eid=663&product=tos&action=view&nav=1,1,1
This can evolve as components are released, updated frequently. My answer is about the status as it is on version 5.3.1