Search code examples
bashdiacriticstransliteration

Removing diacritical marks from a Greek text in an automatic way


I have a decompiled stardict dictionary in the form of a tab file

κακός <tab> bad

where <tab> signifies a tabulation.

Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.

Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become

κακος <tab> <h3>κακός</h3> <br/> bad

I know I could read the file line by line in bash, as described here [1]

while read line           
do           
    command           
done <file 

But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.


Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?

Perseus' way of doing it


/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.

[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?


Solution

  • You can remove diacritics from a string relatively easily using Perl:

    $_=NFKD($_);s/\p{InDiacriticals}//g;
    

    for example:

    $ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
    ωωωωωωω Ω
    

    This works as follows:

    • The -CS enables UTF8 for Perl's stdin/stdout
    • The -MUnicode::Normalize loads a library for Unicode normalisation
    • -e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
    • NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
    • s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks

    This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.