Search code examples
javaapache-commons-codecmetaphone

Unexpected results from Metaphone algorithm


I am using phonetic matching for different words in Java. i used Soundex but its too crude. i switched to Metaphone and realized it was better. However, when i rigorously tested it. i found weird behaviour. i was to ask whether thats the way metaphone works or am i using it in wrong way. In following example its works fine:-

Metaphone meta = new Metaphone();
if (meta.isMetaphoneEqual("cricket","criket")) System.out.prinlnt("Match 1");
if (meta.isMetaphoneEqual("cricket","criketgame")) System.out.prinlnt("Match 2");

This would Print

  Match 1
  Mathc 2

Now "cricket" does sound like "criket" but how come "cricket" and "criketgame" are the same. If some one would explain this. it would be of great help.


Solution

  • Your usage is slightly incorrect. A quick investigation of the encoded strings and default maximum code length shows that it is 4, which truncates the end of the longer "criketgame":

    System.out.println(meta.getMaxCodeLen());
    System.out.println(meta.encode("cricket"));
    System.out.println(meta.encode("criket"));
    System.out.println(meta.encode("criketgame"));
    

    Output (note "criketgame" is truncated from "KRKTKM" to "KRKT", which matches "cricket"):

    4
    KRKT
    KRKT
    KRKT
    


    Solution: Set the maximum code length to something appropriate for your application and the expected input. For example:

    meta.setMaxCodeLen(8);
    System.out.println(meta.encode("cricket"));
    System.out.println(meta.encode("criket"));
    System.out.println(meta.encode("criketgame"));
    

    Now outputs:

    KRKT
    KRKT
    KRKTKM
    

    And now your original test gives the expected results:

    Metaphone meta = new Metaphone();
    meta.setMaxCodeLen(8);
    System.out.println(meta.isMetaphoneEqual("cricket","criket"));
    System.out.println(meta.isMetaphoneEqual("cricket","criketgame"));
    

    Printing:

    true
    false
    

    As an aside, you may also want to experiment with DoubleMetaphone, which is an improved version of the algorithm.


    By the way, note the caveat from the documentation regarding thread-safety:

    The instance field maxCodeLen is mutable but is not volatile, and accesses are not synchronized. If an instance of the class is shared between threads, the caller needs to ensure that suitable synchronization is used to ensure safe publication of the value between threads, and must not invoke setMaxCodeLen(int) after initial setup.