Search code examples
javaunicodecharacterunicode-normalization

Breaking down a Hangul syllable into letters (jamo)


I'm working on a program that deals with Korean sentences and I need a way to break down a syllable, or block, into its letters. For those who don't know Hangul, a syllable is composed of 2-4 letters (jamo), creating thousands of different combinations. What I'd like to do is break down those syllables into the letters that form it.

I was able to get the first letter by comparing its Unicode value to the associated letter in that range, i.e. a syllable that starts with x letter is in y range. However, I'm at a loss for finding the rest of the letters.

This is a table containing the Unicode values for Hangul syllables: http://jrgraphix.net/r/Unicode/AC00-D7AF


Solution

  • Hangul syllable decomposition (e.g. + + ) is done in Java through the java.text.Normalizer class:

    String s = Normalizer.normalize("\uD4DB", Normalizer.Form.NFD);
    

    The algorithm for Hangul decomposition is also given in Section 3.12 of the Unicode Standard (from page 142); and since normalisation also affects other, non-Hangul characters, you should familiarise yourself with the general principles and forms of Unicode normalisation in UAX #15.