I have a dataset which contains pairs of names, it looks like this:
ID; name1; name2
1; Mike Miller; Mike Miler
2; John Doe; Pete McGillen
3; Sara Johnson; Edita Johnson
4; John Lemond-Lee Peter; John LL. Peter
5; Marta Sunz; Martha Sund
6; John Peter; Johanna Petera
7; Joanna Nemzik; Joanna Niemczik
I have some cases, which are labelled. So I check them manually and decide if these are duplicates or not. The manual judgement in these cases would be:
1: Is a duplicate
2: Is not a duplicate
3: Is not a duplicate
4: Is a duplicate
5: Is not a duplicate
6: Is not a duplicate
7: Is a duplicate
(The 7th case is a specific case, because here phonetics come into the game too. However, this is not the main problem, I am ok with ignoring phonetics.)
A first approach would be to calculate the Levenshtein-distance for each pair and mark those as a duplicate, where the Levenshtein-distance is for example less or equal than 2. This would lead to the following output:
1: Levenshtein distance: 2 => duplicate
2: Levenshtein distance: 11 => not a duplicate
3: Levenshtein distance: 4 => not a duplicate
4: Levenshtein distance: 8 => not a duplicate
5: Levenshtein distance: 2 => duplicate
6: Levenshtein distance: 4 => not a duplicate
7: Levenshtein distance: 2 => duplicate
This would be an approach which uses a "fixed" algorithm based on the Levinshtein distance.
Now, I would like to do this task with using a neural network / machine learning:
I do not need the neural network to detect semantic similarity, like "hospital" and "clininc". However, I would like to avoid the Levenshtein-distance, as I would like the ML algorithm to be able to detect "John Lemond-Lee Peter" and "John LL. Peter" as a potential duplicate, also not with a 100% certainty. The Levenshtein distance would lead to a relative high number in this case (8), as there are quite some characters to be added. In a case like "John Peter" and "Johanna Petera" the Levenshtein-distance would lead to a smaller number (4), however this is in fact no duplicate and for this case I would hope that the ML algorithm would be able to detect that this is likely not a duplicate. So I need the ML algorithm to "learn the way I need the duplicates to be checked". With my labelling I would give as an input I would give the ML algorithm the direction, of what I want.
I actually thought that this should be an easy task for a ML algorithm / neural network, but I am not sure.
How can I implement a neural network to compare the pairs of names and identify duplicates without using an explicit distance metric (like the Levenshtein distance, euclidean etc.)?
I thought that it would be possible to convert the strings to numbers and a neural network can work with this and learn to detect duplicates according to my labelling style. So without having to specify a distance metric. I thought about an human: I would give this task to a person and this person would judge and make a decision. This person has no clue about a Levenshtein-distance or any other mathematical concept. So I just want to train the neural network to learn to do what the human is doing. Of course, every human is different and it also depends on my labelling.
(Edit: The ML/neural network solutions I have seen so far (like this) use a metric like levenshtein as a feature input. But as I said I thought it should be possible to teach the neural network the "human judgement" without making use of such a distance measure? Regarding my specific case with having pairs of names: What would the benefit be a of a ML approach using levenshtein distance as a feature? Because it will just detect those pairs of names as a duplicate that have a low levenshtein distance. So I could use a simple algorithm to mark a pair as duplicate if the levenshtein distance between the two names is less than x. Why use a ML instead, what would be the additional benefit?)
I have read carefully whole your question, but still I don't know why you want a neural network for that.
Real, sad answer
Tweak edit distance (more general distance than Levenshtein) by adding some weights - idea: swapping characters that are close on the keyboard is more likely than those that are faraway. So distance between Asa
and Ada
is smaller than Asa
and Ala
.
Case (4) you can cover with regex.
Happy answer If you insist to go with ML solutions, here is the sketch of what I would do if forced: