python python-3.x spell-checking diacritics

Keep records with Spanish accents in Database using Python

I need to purge a DataBase that is in Spanish but the requirement is that I must keep accent marks.

For instance, if the DB contains "Administración" and "Administracion" I have to identify them as equals but keep the one with the accent mark. After some research every solution, like turn Unicode to ASCII or use PyEnchant, keeps the one without the accent mark.

Is there any Library (For Python 3.5) or way to determine the correct one and keep it?

Solution

Caveats

Depending on the content of the database, this may well be a nontrivial task, since though there may be misspellings:

*administracion administración

There are also many pairs of words in Spanish which differ only by an accent, but which both are valid words:

ejército ejercito ejercitó | tu tú

If you are only considering nouns, this number decreases a lot, mostly to foreign loanwords with different stresses:

beisbol béisbol

and a few native words multiple spellings:

período periodo | reúma reuma

Query

If you are unlikely to encounter such cases, you could use a sql query along the lines of:

SELECT a.word AS "Good word", b.word AS "Bad word"
FROM   spanish_db AS a
JOIN   spanish_db AS b

--Spanish words have at most one accent so can safely nest REPLACE
ON     REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(a.word, "á", "a"), 
                                                       "é", "e"), 
                                                       "í", "i"), 
                                                       "ó", "o"), 
                                                       "u", "u") = b.word

--So as not to match identical words
AND    a.word != b.word

This will return all pairs of words where an accented and unaccented form appear. You can adapt this to edit/delete/cleanse the entries as is required.

Example

Good word       Bad word
"acedía"        "acedia"
"aeróbic"       "aerobic"
"aeróstato"     "aerostato"
"afrodisíaco"   "afrodisiaco"
"alcalá"        "alcala"
"alvéolo"       "alveolo"
"alérgeno"      "alergeno"
"amoníaco"      "amoniaco"
"anémona"       "anemona"
"arcén"         "arcen"