I have two database a
and b
with titles of scientific papers in them. I want to merge those databases to one single database c
.
a
contains titles which are not in b
and vice versa.a
and b
.First I thought of using a levenstein distance function inside of MySQL to match the same titles in both databases, but looking at millions of rows I don't know if this would perform well enough. Then I thought of a fulltext search to match the titles but as far as I know fulltext searches do not match common words so the matchings would not perform well on similar titles which are in fact different.
I do not need a 100 % in the matching procedure as a result. But I want to have the rate as high as possible. Any advice?
One idea that comes to mind is to create a search column containing the text without any punctuation and in lower case, and to compare that.
If you are using some other language or platform in connection with mySQL, it might be easiest to do the normalization there - I can't think of a native mySQL function to strip punctuation and such. It's surely possible but maybe only using a hellishly complex set of REPLACE()
calls.