I need to purge a DataBase that is in Spanish but the requirement is that I must keep accent marks.
For instance, if the DB contains "Administración" and "Administracion" I have to identify them as equals but keep the one with the accent mark. After some research every solution, like turn Unicode to ASCII or use PyEnchant, keeps the one without the accent mark.
Is there any Library (For Python 3.5) or way to determine the correct one and keep it?
Depending on the content of the database, this may well be a nontrivial task, since though there may be misspellings:
administracion
administración
There are also many pairs of words in Spanish which differ only by an accent, but which both are valid words:
ejército
ejercito
ejercitó
| tu
tú
If you are only considering nouns, this number decreases a lot, mostly to foreign loanwords with different stresses:
beisbol
béisbol
and a few native words multiple spellings:
período
periodo
| reúma
reuma
If you are unlikely to encounter such cases, you could use a sql query along the lines of:
SELECT a.word AS "Good word", b.word AS "Bad word"
FROM spanish_db AS a
JOIN spanish_db AS b
--Spanish words have at most one accent so can safely nest REPLACE
ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(a.word, "á", "a"),
"é", "e"),
"í", "i"),
"ó", "o"),
"u", "u") = b.word
--So as not to match identical words
AND a.word != b.word
This will return all pairs of words where an accented and unaccented form appear. You can adapt this to edit/delete/cleanse the entries as is required.
Good word Bad word
"acedía" "acedia"
"aeróbic" "aerobic"
"aeróstato" "aerostato"
"afrodisíaco" "afrodisiaco"
"alcalá" "alcala"
"alvéolo" "alveolo"
"alérgeno" "alergeno"
"amoníaco" "amoniaco"
"anémona" "anemona"
"arcén" "arcen"