Search code examples
bashsedreplace

How to remove all of the diacritics from a file?


I have a file containing many vowels with diacritics. I need to make these replacements:

  • Replace ā, á, ǎ, and à with a.
  • Replace ē, é, ě, and è with e.
  • Replace ī, í, ǐ, and ì with i.
  • Replace ō, ó, ǒ, and ò with o.
  • Replace ū, ú, ǔ, and ù with u.
  • Replace ǖ, ǘ, ǚ, and ǜ with ü.
  • Replace Ā, Á, Ǎ, and À with A.
  • Replace Ē, É, Ě, and È with E.
  • Replace Ī, Í, Ǐ, and Ì with I.
  • Replace Ō, Ó, Ǒ, and Ò with O.
  • Replace Ū, Ú, Ǔ, and Ù with U.
  • Replace Ǖ, Ǘ, Ǚ, and Ǜ with Ü.

I know I can replace them one at a time with this:

sed -i 's/ā/a/g' ./file.txt

Is there a more efficient way to replace all of these?


Solution

  • If you check the man page of the tool iconv:

    //TRANSLIT
    When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

    so we could do :

    kent$  cat test1
        Replace ā, á, ǎ, and à with a.
        Replace ē, é, ě, and è with e.
        Replace ī, í, ǐ, and ì with i.
        Replace ō, ó, ǒ, and ò with o.
        Replace ū, ú, ǔ, and ù with u.
        Replace ǖ, ǘ, ǚ, and ǜ with ü.
        Replace Ā, Á, Ǎ, and À with A.
        Replace Ē, É, Ě, and È with E.
        Replace Ī, Í, Ǐ, and Ì with I.
        Replace Ō, Ó, Ǒ, and Ò with O.
        Replace Ū, Ú, Ǔ, and Ù with U.
        Replace Ǖ, Ǘ, Ǚ, and Ǜ with U.
    
    
    kent$  iconv -f utf8 -t ascii//TRANSLIT test1
        Replace a, a, a, and a with a.
        Replace e, e, e, and e with e.
        Replace i, i, i, and i with i.
        Replace o, o, o, and o with o.
        Replace u, u, u, and u with u.
        Replace u, u, u, and u with u.
        Replace A, A, A, and A with A.
        Replace E, E, E, and E with E.
        Replace I, I, I, and I with I.
        Replace O, O, O, and O with O.
        Replace U, U, U, and U with U.
        Replace U, U, U, and U with U.