Search code examples
javascriptregexvalidationunicode

How to use regular expression to validate Chinese input?


The thing is I need to treat this kind of Chinese input as invalid in client side validation:

Input is invalid when any English character mixed with any Chinese character and spaces has a total length >=10.

Let's say : "你的a你的a你的a你" or "你的 你的 你的 你" (length is 10) is invalid. But "你的a你的a你的a" (length is 9) is OK.

I am using both Javascript to do client side validation and Java to do the server side. So I suppose applying the regular expression on both should be perfect.

Can anyone give some hints how to write the rules in regular expression?


Solution

  • From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:

    Block                                   Range       Comment
    --------------------------------------- ----------- ----------------------------------------------------
    CJK Unified Ideographs                  4E00-9FFF   Common
    CJK Unified Ideographs Extension A      3400-4DBF   Rare
    CJK Unified Ideographs Extension B      20000-2A6DF Rare, historic
    CJK Unified Ideographs Extension C      2A700–2B73F Rare, historic
    CJK Unified Ideographs Extension D      2B740–2B81F Uncommon, some in current use
    CJK Unified Ideographs Extension E      2B820–2CEAF Rare, historic
    CJK Compatibility Ideographs            F900-FAFF   Duplicates, unifiable variants, corporate characters
    CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants
    CJK Symbols and Punctuation             3000-303F
    

    You probably want to allow code points from the Unicode blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A.

    This regex will match 0 to 9 spaces, ideographic spaces (U+3000), A-Z letters, or code points in those 2 CJK blocks.

    /^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/
    

    The ideographs are listed in:

    However, you may as well add more blocks.


    Code:

    function has10OrLessCJK(text) {
        return /^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/.test(text);
    }
    
    function checkValidation(value) {
        var valid = document.getElementById("valid");
        if (has10OrLessCJK(value)) {
            valid.innerText = "Valid";
        } else {
            valid.innerText = "Invalid";
        }
    }
    <input type="text" 
           style="width:100%"
           oninput="checkValidation(this.value)"
           value="你的a你的a你的a">
    
    <div id="valid">
        Valid
    </div>