Search code examples
pythonpython-3.xspell-checkingdiacritics

Keep records with Spanish accents in Database using Python


I need to purge a DataBase that is in Spanish but the requirement is that I must keep accent marks.

For instance, if the DB contains "Administración" and "Administracion" I have to identify them as equals but keep the one with the accent mark. After some research every solution, like turn Unicode to ASCII or use PyEnchant, keeps the one without the accent mark.

Is there any Library (For Python 3.5) or way to determine the correct one and keep it?


Solution

  • Caveats

    Depending on the content of the database, this may well be a nontrivial task, since though there may be misspellings:

    • *administracion administración

    There are also many pairs of words in Spanish which differ only by an accent, but which both are valid words:

    • ejército ejercito ejercitó | tu

    If you are only considering nouns, this number decreases a lot, mostly to foreign loanwords with different stresses:

    • beisbol béisbol

    and a few native words multiple spellings:

    • período periodo | reúma reuma

    Query

    If you are unlikely to encounter such cases, you could use a sql query along the lines of:

    SELECT a.word AS "Good word", b.word AS "Bad word"
    FROM   spanish_db AS a
    JOIN   spanish_db AS b
    
    --Spanish words have at most one accent so can safely nest REPLACE
    ON     REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(a.word, "á", "a"), 
                                                           "é", "e"), 
                                                           "í", "i"), 
                                                           "ó", "o"), 
                                                           "u", "u") = b.word
    
    --So as not to match identical words
    AND    a.word != b.word
    

    This will return all pairs of words where an accented and unaccented form appear. You can adapt this to edit/delete/cleanse the entries as is required.


    Example

    Good word       Bad word
    "acedía"        "acedia"
    "aeróbic"       "aerobic"
    "aeróstato"     "aerostato"
    "afrodisíaco"   "afrodisiaco"
    "alcalá"        "alcala"
    "alvéolo"       "alveolo"
    "alérgeno"      "alergeno"
    "amoníaco"      "amoniaco"
    "anémona"       "anemona"
    "arcén"         "arcen"