I'm trying to merge two datasets on Mergers & Acquisitions. They both consist of c.10'000 observations with c.50-100 variables each. One contains information about the actual M&A deal whereas the other one contains info on how a deal was financed.
The problem is that there is no clear and unique identifier. For example, I could use the date that the deal was announced but that wouldn't be unique because on some days 10 deals were announced. Using company names is difficult since they mostly aren't identical in both datasets. For example if in one dataset I find "Ebay", in the other the same company could be called "eBay", "Ebay Inc", or "Ebay, Inc."
I've been working with the Fuzzy Lookup add-on for Excel, as well as concacenating various identifiers that are not unique but in their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.). However I haven't been able to generate as many matches as I would have hoped.
I'd be grateful for any ideas or pointers towards resources that would help me merge the datasets more efficiently.
This is probably a process that requires group tactics applied over several repeated iterations. No single fuzzy match will be able to catch them all the first time. An optimum strategy should narrow the possibilities for manual matching. Once you match them, exclude them from further fuzzy matching. Go through all records using various tactics to whittle down unmatched entries to as few as possible.
As for the tactics, you mention date and that there could be 10 mergers that day. Now you just manually match up just those 10. That becomes a manageable chunk.
Using company names is difficult...
Yes, but you can combine date with Levenshtein Distance (or an equivalent algorithm) to rank the possible choices to narrow the options. So names with eBay in them will all appear closer when you rank them by their Levenshtein Distance.
Other text comparison algorithms besides Levenshtein include Gotoh, Jaro, Soundex, Chapman, etc. Some of these techniques are decades old so the chances of finding Excel add-ins are high. There used to be a vibrant open source group working on these solutions on Sourceforge.net.
...their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.).
Watch out for SIC codes. These are never consistent nor accurate. Depending on who was maintaining those codes you may not get very accurate values beyond 4-digit levels. Also the SIC codes themselves have been updated while the companies were under no obligation to update/revise them as needed. Lastly, SIC codes have been replaced by newer NAICS codes, which in-turn had several versions. During each change, they add new industries, such as Social Media companies that did not exist in the old SIC codes. However SIC/NAICS codes can be useful to eliminate duplicates through self matching.
...any ideas or pointers towards resources...
SQL Server's text indexing features have ready-made algorithms for finding matches. Depending on your resources, that may be something to explore.
If this matching process is going to be a routine task, then you can explore specialized ETL products (extract, transform, load) offered through various data integration services. Some of these ensure bi-directional updates. But no matter how sophisticated the solution, it's all based on a few simple tactics as shown above and nearly all of them contain manual overrides.