Search code examples
regexgore2

How to match a complete string containing unicode characters?


I want to validate a string for e.g. name. A string without spaces. For normal Ascii a following regex would suffice "^\w+$" where ^ and $ takes the whole string into consideration. I tried to achieve the same result for unicode characters for supporting multiple languages using the \pL character class. But for some reason $ doesn't help match end of string. What am I doing wrong?

Code sample is here: https://play.golang.org/p/SPDEbWmqx0N

I copy pasted random characters from: http://www.columbia.edu/~fdc/utf8/

go version go1.12.5 darwin/amd64

package main

import (
    "fmt"
    "regexp"
)

func main() {

    // Unicode character class

    fmt.Println(regexp.MatchString(`^\pL+$`, "testuser"))  // expected true
    fmt.Println(regexp.MatchString(`^\pL+$`, "user with space")) // expected false 


    // Hindi script
    fmt.Println(regexp.MatchString(`^\pL+$`, "सकता")) // expected true doesn't match end of line

    // Hindi script
    fmt.Println(regexp.MatchString(`^\pL+`, "सकता")) // expected true

    // Chinese
    fmt.Println(regexp.MatchString(`^\pL+$`, "我能")) // expected true

    //French
    fmt.Println(regexp.MatchString(`^\pL+$`, "ægithaleshâtifs")) // expected true 

}
actual result:
true  <nil>
false <nil>
false <nil>
true <nil>
true <nil>
true <nil>

expected result:
true <nil>
false <nil>
true <nil>
true <nil>
true <nil>
true <nil>

Solution

  • You may use

    ^[\p{L}\p{M}]+$
    

    See Go demo.

    Details

    • ^ - start of string
    • [ - start of a character class that matches
      • \p{L} - any BMP letter
      • \p{M} - any diacritic
    • ]+ - end of the character class, repeat 1+ times
    • $ - end of string.

    If you plan to also match digits and _ as \w does, add them to the character class, ^[\p{L}\p{M}0-9_]+$ or ^[\p{L}\p{M}\p{N}_]+$.