Search code examples
javascriptdiacritics

Remove accents/diacritics in a string in JavaScript


How do I remove accentuated characters from a string? Especially in IE6, I had something like this:

accentsTidy = function(s){
    var r=s.toLowerCase();
    r = r.replace(new RegExp(/\s/g),"");
    r = r.replace(new RegExp(/[àáâãäå]/g),"a");
    r = r.replace(new RegExp(/æ/g),"ae");
    r = r.replace(new RegExp(/ç/g),"c");
    r = r.replace(new RegExp(/[èéêë]/g),"e");
    r = r.replace(new RegExp(/[ìíîï]/g),"i");
    r = r.replace(new RegExp(/ñ/g),"n");                
    r = r.replace(new RegExp(/[òóôõö]/g),"o");
    r = r.replace(new RegExp(/œ/g),"oe");
    r = r.replace(new RegExp(/[ùúûü]/g),"u");
    r = r.replace(new RegExp(/[ýÿ]/g),"y");
    r = r.replace(new RegExp(/\W/g),"");
    return r;
};

but IE6 bugs me, seems it doesn't like my regular expression.


Solution

  • With ES2015/ES6 String.prototype.normalize(),

    const str = "Crème Brûlée"
    str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
    > "Creme Brulee"
    

    Note: use NFKD if you want things like \uFB01() normalized (to fi).

    Two things are happening here:

    1. normalize()ing to NFD Unicode normal form decomposes combined graphemes into the combination of simple ones. The è of Crème ends up expressed as e + ̀.
    2. Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.

    As of 2021, one can also use Unicode property escapes:

    str.normalize("NFD").replace(/\p{Diacritic}/gu, "")
    

    See comment for performance testing.

    Alternatively, if you just want sorting

    Intl.Collator has sufficient support ~95% right now, a polyfill is also available here but I haven't tested it.

    const c = new Intl.Collator();
    ["creme brulee", "crème brûlée", "crame brulai", "crome brouillé",
    "creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
    [ 'crame brulai', 'creme brulay', 'creme bruléa', 'creme brulee', 'crème brûlée', 'creme brulfé', 'crome brouillé']
    
    
    ["crème brûlée", "crame brulai", "creme brulee", "crexe brulee", "crome brouillé"].sort()
    [ 'crame brulai', 'creme brulee', 'crexe brulee', 'crome brouillé', 'crème brûlée']
    
    ["crème brûlée", "crame brulai", "creme brulee", "crexe brulee", "crome brouillé"].sort((a,b) => a.localeCompare(b))
    [ 'crame brulai', 'creme brulee', 'crème brûlée', 'crexe brulee', 'crome brouillé']