I am trying to write a function
func Anonymize(name string) string
that anonymizes names. Here are some examples of pairs of input and output so you get an idea of what it is supposed to do:
Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.
This function is supposed to work with names composed out of arbitrary characters. While implementing this function, I had the following question:
Given a []rune
or string
, how do I figure out how many runes I have to take to get a complete character, complete in the sense that all modifiers and combining accents corresponding to the character are taken, too. For instance, if the input is []rune{0x0041, 0x0308, 0x0066, 0x0067}
(corresponding to the string ÄBC where Ä is represented as the combination of an A and a combining diaresis), the function should return 2 because the first two runes yield the first character, Ä. If I just took the first rune, I would get A which is incorrect.
I need an answer to this question because the name I want to anonymize might begin with an accented character and I don't want to remove the accent.
You can try the following function (inspired by "Go language string length"):
func FirstGraphemeLen(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len([]rune(re.FindAllString(str, -1)[0]))
}
See this example:
r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))
Output:
ÄfÄg 6 2
That string might use 6 runes, but its first grapheme uses 2.
The OP FUZxxl used another approach, using unicode.IsMark(r)
IsMark
reports whether the rune is a mark character (category M).
The source (from FUZxxl's play.golang.org) includes:
// take one character including all modifiers from the last name
r, _, err := ln.ReadRune()
if err != nil {
/* ... */
}
aln = append(aln, r)
for {
r, _, err = ln.ReadRune()
if err != nil {
goto done
}
if !unicode.IsMark(r) {
break
}
aln = append(aln, r)
}
aln = append(aln, '.')
/* ... */