Search code examples
mysqlfull-text-searchcase-insensitivestring-matchingpunctuation

Given a string of words: How to find every word (case insensitive) in a varchar(1000) column in MySQL ignoring punctuation?


I have two database a and b with titles of scientific papers in them. I want to merge those databases to one single database c.

  • It is possible that a contains titles which are not in b and vice versa.
  • It is possible that a title is in both databases a and b.
  • It is possible that the cases of the letters and the punctuation do not match:
    • "This is a Title." VS. "this is a title"
    • "This is - yet another - title." VS. "This is yet another title"
    • "The k-mean algorithm based on bla." VS "The k mean Algorithm based on bla"

First I thought of using a levenstein distance function inside of MySQL to match the same titles in both databases, but looking at millions of rows I don't know if this would perform well enough. Then I thought of a fulltext search to match the titles but as far as I know fulltext searches do not match common words so the matchings would not perform well on similar titles which are in fact different.

I do not need a 100 % in the matching procedure as a result. But I want to have the rate as high as possible. Any advice?


Solution

  • One idea that comes to mind is to create a search column containing the text without any punctuation and in lower case, and to compare that.

    If you are using some other language or platform in connection with mySQL, it might be easiest to do the normalization there - I can't think of a native mySQL function to strip punctuation and such. It's surely possible but maybe only using a hellishly complex set of REPLACE() calls.