Search code examples
javascriptregexunicodecharacter-properties

Latin Characters check


there are some similar questions out there, but none that are quite the same or that have an answer that works for me.

I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically:

Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+IE00 to U+IEFF

Some of the answers out there seem to check the first character in the text field but miss out others, so these are no good.

This is what I have tried so far (this doesn't work!):

var value = 'abcdef' // from text field
var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// var re = '\\w+/'; // alternative
if (new RegExp(re).test(value)) {
    result = false;
}

The following sort of works but only for the first character:

//var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// couldn't get the above to work so using the following:
var re = '\\w+';
if (!value.match(re)) {
    message = 'Please enter valid latin characters only';
    $focusField = $this;
}

What is the right way to do this?

I really need code, rather than an explaination, but both would be better.

Thanks


Solution

  • EDIT: Note that the solution given in the accepted answer is incorrect. It is full of false positives and false negatives. The exact numeric code point numbers needed are given at the bottom of this post.

    The example given by the question mistakenly attempt to use Block rather than Script properties!

    You do not want to use Unicode block character properties here; you want to use Unicode script character properties. In other words, you really want Script=Latin and not to try to use Block=Basic_Latin plus Block=Latin_1 plus Block=Latin_1_Supplement plus Block=Latin_Extended_A plus Block=Latin_Extended_Additional.

    Note also that the question neglected to other Latin blocks: Block=Latin_Extended_C and Block=Latin_Extended_D.

    Even if you used the correct blocks, you would get 145 false positives that were in those blocks but which were not Latin script characters:

    $ unichars '\P{Script=Latin}' '[\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B}
    \p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
    145
    

    Furthermore, you would miss 403 false negatives that are indeed Latin script characters but which are not in those blocks:

    $ unichars '\p{Script=Latin}' '[^\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B
    }\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
    403
    

    You virtually never want to use Blocks; you want to use Scripts. That’s why Level 1 conformance of UTS#18 requires in Requirement 1.2that the Script character property be supported, but says nothing of the Block property until Requirement 2.7: Full Properties.

    See UTS#18 Annex A, Character Blocks, for more pitfalls that come of using Blocks instead of Scripts.

    Removing the code points that lie outside the Basic Multilingual Plane due to the Javascript bug that makes it impossible to specify these by ranges, we are left with this set of insanely unmaintainable garbledy-gook needed to fish out all Unicode v6.2 code points having the Latin, Common, or Inherited script character property:

    [\u0000-\u0040][\u0041-\u005A][\u005B-\u0060][\u0061-\u007A][\u007B-\u00A9]\u00AA[\u00AB-\u00B9]\u00BA[\u00BB-\u00BF][\u00C0-\u00D6]\u00D7[\u00D8-\u00
    F6]\u00F7[\u00F8-\u02B8][\u02B9-\u02DF][\u02E0-\u02E4][\u02E5-\u02E9][\u02EC-\u02FF][\u0300-\u036F]\u0374\u037E\u0385\u0387[\u0485-\u0486]\u0589\u060C
    \u061B\u061F\u0640[\u064B-\u0655][\u0660-\u0669]\u0670\u06DD[\u0951-\u0952][\u0964-\u0965]\u0E3F[\u0FD5-\u0FD8]\u10FB[\u16EB-\u16ED][\u1735-\u1736][\u
    1802-\u1803]\u1805[\u1CD0-\u1CD2]\u1CD3[\u1CD4-\u1CE0]\u1CE1[\u1CE2-\u1CE8][\u1CE9-\u1CEC]\u1CED[\u1CEE-\u1CF3]\u1CF4[\u1CF5-\u1CF6][\u1D00-\u1D25][\u
    1D2C-\u1D5C][\u1D62-\u1D65][\u1D6B-\u1D77][\u1D79-\u1DBE][\u1DC0-\u1DE6][\u1DFC-\u1DFF][\u1E00-\u1EFF][\u2000-\u200B][\u200C-\u200D][\u200E-\u2064][\u
    206A-\u2070]\u2071[\u2074-\u207E]\u207F[\u2080-\u208E][\u2090-\u209C][\u20A0-\u20BA][\u20D0-\u20F0][\u2100-\u2125][\u2127-\u2129][\u212A-\u212B][\u212
    C-\u2131]\u2132[\u2133-\u214D]\u214E[\u214F-\u215F][\u2160-\u2188]\u2189[\u2190-\u23F3][\u2400-\u2426][\u2440-\u244A][\u2460-\u26FF][\u2701-\u27FF][\u
    2900-\u2B4C][\u2B50-\u2B59][\u2C60-\u2C7F][\u2E00-\u2E3B][\u2FF0-\u2FFB][\u3000-\u3004]\u3006[\u3008-\u3020][\u302A-\u302D][\u3030-\u3037][\u303C-\u30
    3F][\u3099-\u309A][\u309B-\u309C]\u30A0[\u30FB-\u30FC][\u3190-\u319F][\u31C0-\u31E3][\u3220-\u325F][\u327F-\u32CF][\u3358-\u33FF][\u4DC0-\u4DFF][\uA70
    0-\uA721][\uA722-\uA787][\uA788-\uA78A][\uA78B-\uA78E][\uA790-\uA793][\uA7A0-\uA7AA][\uA7F8-\uA7FF][\uA830-\uA839][\uFB00-\uFB06][\uFD3E-\uFD3F]\uFDFD
    [\uFE00-\uFE0F][\uFE10-\uFE19][\uFE20-\uFE26][\uFE30-\uFE52][\uFE54-\uFE66][\uFE68-\uFE6B]\uFEFF[\uFF01-\uFF20][\uFF21-\uFF3A][\uFF3B-\uFF40][\uFF41-\
    uFF5A][\uFF5B-\uFF65]\uFF70[\uFF9E-\uFF9F][\uFFE0-\uFFE6][\uFFE8-\uFFEE][\uFFF9-\uFFFD]
    

    Personally, I would fire anyone who attempted to use that sort of nonsense.

    Furthermore, 3,225 code points that you miss because of the Javascript bug in handling full Unicode are the following:

    10100-10102 10107-10133 10137-1013F 10190-1019B 101D0-101FC 101FD
    1D000-1D0F5 1D100-1D126 1D129-1D166 1D167-1D169 1D16A-1D17A 1D17B-1D182
    1D183-1D184 1D185-1D18B 1D18C-1D1A9 1D1AA-1D1AD 1D1AE-1D1DD 1D300-1D356
    1D360-1D371 1D400-1D454 1D456-1D49C 1D49E-1D49F 1D4A2 1D4A5-1D4A6
    1D4A9-1D4AC 1D4AE-1D4B9 1D4BB 1D4BD-1D4C3 1D4C5-1D505 1D507-1D50A
    1D50D-1D514 1D516-1D51C 1D51E-1D539 1D53B-1D53E 1D540-1D544 1D546
    1D54A-1D550 1D552-1D6A5 1D6A8-1D7CB 1D7CE-1D7FF 1F000-1F02B 1F030-1F093
    1F0A0-1F0AE 1F0B1-1F0BE 1F0C1-1F0CF 1F0D1-1F0DF 1F100-1F10A 1F110-1F12E
    1F130-1F16B 1F170-1F19A 1F1E6-1F1FF 1F201-1F202 1F210-1F23A 1F240-1F248
    1F250-1F251 1F300-1F320 1F330-1F335 1F337-1F37C 1F380-1F393 1F3A0-1F3C4
    1F3C6-1F3CA 1F3E0-1F3F0 1F400-1F43E 1F440 1F442-1F4F7 1F4F9-1F4FC
    1F500-1F53D 1F540-1F543 1F550-1F567 1F5FB-1F640 1F645-1F64F 1F680-1F6C5
    1F700-1F773 E0001 E0020-E007F E0100-E01EF
    

    The correct way to do all this is included below.

    If you are going to be playing around with Unicode character properties, it is tantamount to hopeless to hardcode code-point numbers like this. What you really want is to be able to say something like:

    [^\p{Script=Latin}\p{Script=Common}\p{Script=Inherited}]
    

    However, Javascript regexes are still completely antemillennial in this regard, and are so far from complying with Unicode Technical Standard #18: Unicode Regular Expressions, even at its very most basic compliance level, level one:

    Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.

    Because even the most rudimentary compliance level for Unicode regular expressions is still far beneath Javascript’s capabilities, I strongly recommending running whatever Unicode-aware regexes you need on the server in some language that actually supports them.

    However, in the event that this is not practical, a sanity-saving workaround is the Javascript XRegExp plugin, which provides a saner regex library that also allows for access to certain essential character properties such as you are attempting to use.

    As of v2.0, the “XRegExp All” add-on supports all these:

    • XRegExp 2.0.0
    • Unicode Base 1.0.0
    • Unicode Categories 1.2.0
    • Unicode Scripts 1.2.0
    • Unicode Blocks 1.2.0
    • Unicode Properties 1.0.0
    • XRegExp.matchRecursive 0.2.0
    • XRegExp.build 0.1.0
    • Prototypes 1.0.0

    Which means that once you have it loaded, you will be able to get at the properties you need this way:

    XRegExp("[^\\p{Latin}\\p{Common}\\p{Inherited}]");
    

    Please note very carefully that as of Unicode v6.2, any and all of the following code points and code-point ranges are deemed to have the Script=Latin character property:

    0041-005A 
    0061-007A 
    00AA 
    00BA 
    00C0-00D6 
    00D8-00F6 
    00F8-02B8 
    02E0-02E4 
    1D00-1D25 
    1D2C-1D5C 
    1D62-1D65 
    1D6B-1D77 
    1D79-1DBE 
    1E00-1EFF 
    2071 
    207F 
    2090-209C 
    212A-212B 
    2132 
    214E 
    2160-2188 
    2C60-2C7F 
    A722-A787 
    A78B-A78E 
    A790-A793 
    A7A0-A7AA 
    A7F8-A7FF 
    FB00-FB06 
    FF21-FF3A 
    FF41-FF5A 
    

    Whereas these are the code points that have the Script=Common character property:

    0000-0040  
    005B-0060  
    007B-00A9  
    00AB-00B9  
    00BB-00BF  
    00D7
    00F7
    02B9-02DF  
    02E5-02E9  
    02EC-02FF  
    0374
    037E
    0385 
    0387
    0589
    060C
    061B
    061F
    0640
    0660-0669  
    06DD
    0964-0965  
    0E3F 
    0FD5-0FD8  
    10FB
    16EB-16ED
    1735-1736
    1802-1803
    1805
    1CD3
    1CE1
    1CE9-1CEC
    1CEE-1CF3
    1CF5-1CF6
    2000-200B
    200E-2064
    206A-2070  
    2074-207E  
    2080-208E  
    20A0-20BA  
    2100-2125
    2127-2129
    212C-2131  
    2133-214D  
    214F-215F  
    2189
    2190-23F3
    2400-2426
    2440-244A
    2460-26FF
    2701-27FF
    2900-2B4C
    2B50-2B59
    2E00-2E3B
    2FF0-2FFB  
    3000-3004
    3006
    3008-3020
    3030-3037  
    303C-303F
    309B-309C
    30A0
    30FB-30FC
    3190-319F
    31C0-31E3
    3220-325F
    327F-32CF
    3358-33FF
    4DC0-4DFF
    A700-A721
    A788-A78A
    A830-A839
    FD3E-FD3F  
    FDFD
    FE10-FE19  
    FE30-FE52
    FE54-FE66
    FE68-FE6B  
    FEFF
    FF01-FF20  
    FF3B-FF40
    FF5B-FF65
    FF70
    FF9E-FF9F
    FFE0-FFE6
    FFE8-FFEE
    FFF9-FFFD
    10100-10102
    10107-10133
    10137-1013F
    10190-1019B
    101D0-101FC
    1D000-1D0F5
    1D100-1D126
    1D129-1D166
    1D16A-1D17A
    1D183-1D184
    1D18C-1D1A9
    1D1AE-1D1DD
    1D300-1D356
    1D360-1D371
    1D400-1D454
    1D456-1D49C
    1D49E-1D49F
    1D4A2
    1D4A5-1D4A6
    1D4A9-1D4AC
    1D4AE-1D4B9
    1D4BB
    1D4BD-1D4C3
    1D4C5-1D505
    1D507-1D50A
    1D50D-1D514
    1D516-1D51C
    1D51E-1D539
    1D53B-1D53E
    1D540-1D544
    1D546
    1D54A-1D550
    1D552-1D6A5
    1D6A8-1D7CB
    1D7CE-1D7FF
    1F000-1F02B
    1F030-1F093
    1F0A0-1F0AE
    1F0B1-1F0BE
    1F0C1-1F0CF
    1F0D1-1F0DF
    1F100-1F10A
    1F110-1F12E
    1F130-1F16B
    1F170-1F19A
    1F1E6-1F1FF
    1F201-1F202
    1F210-1F23A
    1F240-1F248
    1F250-1F251
    1F300-1F320
    1F330-1F335
    1F337-1F37C
    1F380-1F393
    1F3A0-1F3C4
    1F3C6-1F3CA
    1F3E0-1F3F0
    1F400-1F43E
    1F440
    1F442-1F4F7
    1F4F9-1F4FC
    1F500-1F53D
    1F540-1F543
    1F550-1F567
    1F5FB-1F640
    1F645-1F64F
    1F680-1F6C5
    1F700-1F773
    E0001
    E0020-E007F
    

    And these are the code points that have the Script=Inherited character property:

    0300-036F
    0485-0486
    064B-0655
    0670
    0951-0952
    1CD0-1CD2
    1CD4-1CE0
    1CE2-1CE8
    1CED
    1CF4
    1DC0-1DE6
    1DFC-1DFF
    200C-200D
    20D0-20F0
    302A-302D
    3099-309A
    FE00-FE0F
    FE20-FE26
    101FD
    1D167-1D169
    1D17B-1D182
    1D185-1D18B
    1D1AA-1D1AD
    E0100-E01EF
    

    I hope the terrible maintenance, upkeep, legibility, and indeed writability problems that come of using literal code-point numbers like these make it clear that you want to at a bare minimum use the XRegExp add-ons.