Search code examples
unicodegorune

How to retrieve the first “complete” character of a []rune?


I am trying to write a function

func Anonymize(name string) string

that anonymizes names. Here are some examples of pairs of input and output so you get an idea of what it is supposed to do:

Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.

This function is supposed to work with names composed out of arbitrary characters. While implementing this function, I had the following question:

Given a []rune or string, how do I figure out how many runes I have to take to get a complete character, complete in the sense that all modifiers and combining accents corresponding to the character are taken, too. For instance, if the input is []rune{0x0041, 0x0308, 0x0066, 0x0067} (corresponding to the string ÄBC where Ä is represented as the combination of an A and a combining diaresis), the function should return 2 because the first two runes yield the first character, Ä. If I just took the first rune, I would get A which is incorrect.

I need an answer to this question because the name I want to anonymize might begin with an accented character and I don't want to remove the accent.


Solution

  • You can try the following function (inspired by "Go language string length"):

    func FirstGraphemeLen(str string) int {
        re := regexp.MustCompile("\\PM\\pM*|.")
        return len([]rune(re.FindAllString(str, -1)[0]))
    }
    

    See this example:

    r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
    s := string(r)
    fmt.Println(s, len(r), FirstGraphemeLen(s))
    

    Output:

    ÄfÄg 6 2
    

    That string might use 6 runes, but its first grapheme uses 2.


    The OP FUZxxl used another approach, using unicode.IsMark(r)

    IsMark reports whether the rune is a mark character (category M).

    The source (from FUZxxl's play.golang.org) includes:

    // take one character including all modifiers from the last name
    r, _, err := ln.ReadRune()
    if err != nil {
        /* ... */
    }
    
    aln = append(aln, r)
    
    for {
        r, _, err = ln.ReadRune()
        if err != nil {
            goto done
        }
    
        if !unicode.IsMark(r) {
            break
        }
    
        aln = append(aln, r)
    }
    
    aln = append(aln, '.')
    /* ... */