python regex validation character alphabet

Regex to match all Hangul (Korean) characters and syllable blocks

I'm trying to validate user input (in Python) and see if the right language is being used, Korean in this case. Lets take the Korean word for email address: 이메일 주소

I can check each character like so:

import unicodedata as ud
for chr in u'이메일 주소':
    if 'HANGUL' in ud.name(chr): print "Yep, that's a Korean character."

But that seems highly inefficient, especially for longer texts. Of course, I could create a static dictionary containing all Korean syllable blocks, but that dictionary would contain some 25,000 characters and again, that would be inefficient to check against. Also, I also need a solution for Japanese and Chinese, which may contain even more characters.

Therefore, I'd like to use a Regex pattern covering all Unicode characters for Hangul syllable blocks. But I have no clue if there is a range for that or where to find it.

As an example, this regex pattern covers all Latin based characters, including brackets and other commonly used symbols:

import re
LATIN_CHARACTERS = re.compile(ur'[\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]')

Can somebody translate this regex to match Korean Hangul syllable block? Or can you show me a table or reference to lookup such ranges myself?

A pattern to match Chinese and Japanese would also be very helpful. Or one regex to match all CJK characters at once. I wouldn't need to distinguish between Japanese and Korean.

Here's a Python library for that task, but it works with incredibly huge dictionaries: https://github.com/EliFinkelshteyn/alphabet-detector I cannot imagine that to be efficient for large texte and lots of user inputs.

Thanks!

Solution

You are aware of how Unicode is broken into blocks, and how each block represents a contiguous range of code-points? IE, there's a much more efficient solution than a regular expression.

There is a single code block for Hangul Jamo, with additional characters in the CJK block, a compatability block, Hangul syllables, etc.

The most efficient way is to check if each character is within the acceptable range, using if/then statements. You could almost certainly speed this up using a C-extension.

For example, if I were just checking the Hangul block (insufficient, but merely a simple starting place), I would check each character in a string with the following code:

def is_hangul_character(char):
    '''Check if character is in the Hangul Jamo block'''

    value = ord(char)
    return value >= 4352 and value <= 4607


def is_hangul(string):
    '''Check if all characters are in the Hangul Jamo block'''

    return all(is_hangul_character(i) for i in string)

It would be easy to extend this for the 8 or so blocks that contain Hangul characters. No tables lookups, no regex compilation. Just fast range checks based on the block of the Unicode character.

In C, this would be very easy as well (if you would like a significant performance boost, to match a fully-optimized library with little work):

// Return 0 if a character is in Hangul Jamo block, -1 otherwise
int is_hangul_character(char32_t c)
{
    if (c >= 4352 && c <= 4607) {
        return 0;
    }
    return -1;
}


// Return 0 if all characters are in Hangul Jamo block, -1 otherwise
int is_hangul(const char32_t* string, size_t length)
{
    size_t i;
    for (i = 0; i < length; ++i) {
        if (is_hangul_character(string[i]) < 0) {
            return -1;
        }
    }
    return 0;
}

Edit A cursory glance at the CPython implementation shows CPython uses this exact approach for the unicodedata module. IE, it's efficient despite the relative ease to implement it on your own. It is still worth implementing, since you don't have to allocate any intermediate string, or use superfluous string comparisons (which is likely the primary cost of the unicodedata module).