Search code examples
pythonpython-2.7machine-learningartificial-intelligencecluster-analysis

clustering inside clustering that is nested clustering of a data table that is multiclass clustering


How to apply clustering of strings which are having similar name(like McDonald and Mc DOnald's) in a dataset and if string are same (like sam and other also sam) then how to again do clustering based on value or price for example- Consider a data table having 10 elements

name           price
ram               200
shyam             150
ram12              59
gita               45
ram 2                45 
g11ita                23
john2                32
john                 7
jonh21               8
jonh                 38
ram22                3

Then grouping should be

ram                    200

ram12                  59
ram  2                 45

ram22                   3

john2                    32
jonh                     37

john                    7
john21                   8

gita                 45
g11ita               23      

I have used string clustering using fuzzywuzzy and Levenheneitein distance but it only able to cluster string and does no able to cluster price How to cluster first string and if same then cluster price


Solution

  • You will need to carefully balance thresholds in textual similarity and in numerical similarity. There won't be an easy solution, and unless you have really huge data, a manual approach may be best.

    Textual similarity of short strings is highly unreliable.

    For example: "dog" and "fog" only differ by a single letter, but are very unlikely typos. They have Levenshtein distance 1, the smallest non-zero value! Because of this, if you rely on Levenshtein, you will have plenty of false positives - okay if you manually verify them, but not for automatic processing.

    So at the minimum you'll need to use something that knows about (a) existing words, that are unlikely misspelled, (b) common misspellings, and (c) phonetic similarity to estimate how likely a word is misspelled, (d) keyboard similarity, how likely a word is mistyped...