Search code examples
pythonunicodenlpimekorean-nlp

Is there a way to programmatically combine Korean unicode into one?


Using a Korean Input Method Editor (IME), it's possible to type 버리 + and it will automatically become 버려.

Is there a way to programmatically do that in Python?

>>> x, y = '버리', '어'
>>> z = '버려'
>>> ord(z[-1])
47140
>>> ord(x[-1]), ord(y)
(47532, 50612)

Is there a way to compute that 47532 + 50612 -> 47140?

Here's some more examples:

가보 + 아 -> 가봐

끝나 + ㄹ -> 끝날


Solution

  • I'm a Korean. First, if you type 버리 + , it becomes 버리어 not 버려. 버려 is an abbreviation of 버리어 and it's not automatically generated. Also 가보아 cannot becomes 가봐 automatically during typing by the same reason.

    Second, by contrast, 끝나 + becomes 끝날 because has no jongseong(종성). Note that one character of Hangul is made of choseong(초성), jungseong(중성), and jongseong. choseong and jongseong are a consonant, jungseong is a vowel. See more at Wikipedia. So only when there's no jongseong during typing (like 끝나), there's a chance that it can have jongseong(ㄹ).

    If you want to make 버리 + to 버려, you should implement some Korean language grammar like, especially for this case, abbreviation of jungseong. For example + = , + = as you provided. 한글 맞춤법 chapter 4. section 5 (I can't find English pages right now) defines abbreviation like this. It's possible, but not so easy job especially for non-Koreans.

    Next, if what you want is just to make 끝나 + to 끝날, it can be a relatively easy job since there're libraries which can handle composition and decomposition of choseong, jungseong, jongseong. In case of Python, I found hgtk. You can try like this (nonpractical code):

    # hgtk methods take one character at a time
    cjj1 = hgtk.letter.decompose('나')  # ('ㄴ', 'ㅏ', '')
    cjj2 = hgtk.letter.decompose('ㄹ')  # ('ㄹ', '', '')
    if cjj1[2]) == '' and cjj2[1]) == '':
        cjj = (cjj1[0], cjj1[1], cjj2[0])
        cjj2 = None
    

    Still, without proper knowledge of Hangul, it will be very hard to get it done.