Search code examples
sql-serverunicodesql-server-2008-r2cjk

Finding index of Korean Alphabet (any Unicode char) in Korean Word (any Unicode Word) in SQL server


I have a requirement to search for people by name. Here peoples' names can be in English, Korean, or Chinese languages. For this I used Like condition to search on the basis of Name as below:

select * from [MyTable] where Name like N'%t%'

The above statement is giving all the users which contains letter t. But this is not working with Korean or Chinese languages. Like if I search with Korean letter then it is supposed to give all the names which contains this letter like **정수연, 재훈아이팟, 정원혁 테스트 7**. I have tried the following ways but it's giving zero results

select * from [MyTable] where Name like N'%ㅈ%' - No Results
select PATINDEX(N'%ㅈ%',N'정수연(Mohan)') - giving value as ZERO
select Charindex(N'ㅈ',N'정수연') - giving value as ZERO

Is there any way to find the letters of the alphabets of other languages in SQL server?

I know how to find alphabet existence in other language in C# words by using encoding techniques but not in SQL server. Please help me in this regard.

Thanks in advance.

EDIT for C# code

public static string DecomposeSyllabels(string unicodeString) {
      try {
        //Consonant consonant only used
        string[] JLT = { "ㄱ", "ㄲ", "ㄴ", "ㄷ", "ㄸ", "ㄹ", "ㅁ", "ㅂ", "ㅃ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅉ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ" };

        // Only used a collection of neutral
        string[] JVT = { "ㅏ", "ㅐ", "ㅑ", "ㅒ", "ㅓ", "ㅔ", "ㅕ", "ㅖ", "ㅗ", "ㅘ", "ㅙ", "ㅚ", "ㅛ", "ㅜ", "ㅝ", "ㅞ", "ㅟ", "ㅠ", "ㅡ", "ㅢ", "ㅣ" };

        // Initial and coda consonants used in
        string[] JTT = { "", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ", "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ", "ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ" };

        double SBase = 0xAC00;
        long SCount = 11172;
        int TCount = 28;
        int NCount = 588;
        string syllables = string.Empty;

        foreach (char c in unicodeString) {
          double SIndex = (int)c - SBase;
          if (0 > SIndex || SIndex >= SCount) {
            syllables = syllables + c;
            continue;
          }

          int LIndex = (int)Math.Floor(SIndex / NCount);
          int VIndex = (int)(Math.Floor((SIndex % NCount) / TCount));
          int TIndex = (int)(SIndex % TCount);
          syllables = syllables + (JLT[LIndex] + JVT[VIndex] + JTT[TIndex]);
        }

        return syllables;
      }
      catch {
        return unicodeString;
      }
    }

Solution

  • You'll have to decompose the Korean syllables and store these into a separate column in your SQL db (like ㅈㅓㅇㅅㅜㅇㅕㄴ for 정수연). I'd suggest you write a small custom app that parses your db, decomposes all Korean syllables, and saves the results into a separate column.

    EDIT

    Here's some Python code that will decompose Hangul syllables:

    #!/usr/local/bin/python
    # -*- coding: utf8 -*-
    import codecs, sys, os, math
    
    JLT="ㄱ,ㄲ,ㄴ,ㄷ,ㄸ,ㄹ,ㅁ,ㅂ,ㅃ,ㅅ,ㅆ,ㅇ,ㅈ,ㅉ,ㅊ,ㅋ,ㅌ,ㅍ,ㅎ".split(",")
    JTT=",ㄱ,ㄲ,ㄱㅅ,ㄴ,ㄴㅈ,ㄴㅎ,ㄷ,ㄹ,ㄹㄱ,ㄹㅁ,ㄹㅂ,ㄹㅅ,ㄹㅌ,ㄹㅍ,ㄹㅎ,ㅁ,ㅂ,ㅂㅅ,ㅅ,ㅆ,ㅇ,ㅈ,ㅊ,ㅋ,ㅌ,ㅍ,ㅎ".split(",")
    JVT="ㅏ,ㅐ,ㅑ,ㅒ,ㅓ,ㅔ,ㅕ,ㅖ,ㅗ,ㅘ,ㅙ,ㅚ,ㅛ,ㅜ,ㅝ,ㅞ,ㅟ,ㅠ,ㅡ,ㅢ,ㅣ".split(",")
    SBase=0xAC00
    SCount=11172
    TCount=28
    NCount=588
    
    def HangulName(a):
     b=a.decode('utf8')
     sound=''
     for i in b:
      cp=ord(i)
      SIndex = cp - SBase
      if (0 > SIndex or SIndex >= SCount):
        # "Not a Hangul Syllable"
        pass
      LIndex = int(math.floor(SIndex / NCount))
      VIndex = int(math.floor((SIndex % NCount) / TCount))
      TIndex = int(SIndex % TCount)
      sound=sound+(JLT[LIndex] + JVT[VIndex] + JTT[TIndex]).lower()
     return sound
    
    print HangulName("정수연")
    

    dda$ python test.py
    ㅈㅓㅇㅅㅜㅇㅕㄴ