I have a table of book titles - most of them dublicate several times for different editions. Many titles were mistakenly imported with missing non-ASCII characters i.e. "La métamorphose" turned into "La m?tamorphose" sometimes the é turned into a space or was simply removed from string.
The table
editionid | bookid | title
--------------------------------------------
1 | 1 | Elementarne čestice
2 | 1 | Elementarne ?estice
3 | 1 | Elementarne estice
4 | 1 | Las partículas elementales
5 | 2 | Schöne neue Welt
6 | 2 | Sch ne neue Welt
I want to identify the incorrect titles by stripping the non-ASCIIs of titles and comparing to other titles of the same book. If there is a match I found a flawed title.
result:
o.title (flawed) | e.title (good)
-----------------------------------
Elementarne ?estice | Elementarne čestice
Elementarne estice | Elementarne čestice
Sch ne neue Welt | Schöne neue Welt
The table is rather large but since I only need to do this once performance is not key.
My approach:
select distinct on (o.editionid) o.title, e.title
from editions o
inner join editions e on (o.bookid = e.bookid)
where o.bookid between 1 and 1000
and e.title !~ '^[ -~]*$' -- only for performance
and ((
e.title like '%Þ%' and (o.title = regexp_replace(e.title, '[Þ]', '?') or o.title = regexp_replace(e.title, '[Þ]', ' ') or o.title = regexp_replace(e.title, '[Þ]', ''))
) or (
e.title like '%ß%' and (o.title = regexp_replace(e.title, '[ß]', '?') or o.title = regexp_replace(e.title, '[ß]', ' ') or o.title = regexp_replace(e.title, '[ß]', ''))
) or (
e.title like '%à%' and (o.title = regexp_replace(e.title, '[à]', '?') or o.title = regexp_replace(e.title, '[à]', ' ') or o.title = regexp_replace(e.title, '[à]', ''))
.
.
.
))
That works so far but it appears impossible to add all non-ASCII characters separately. Does anyone have an idea of a more general approach which covers all non-ASCII chars at once?
Second - it doesn't work if two different characters where stripped and I don't know how to solve that.
And third but maybe impossible - often only some non-ASCIIs were turned but not all i.e. "Weiße Nächte" turned into "Wei e Nächte" - it would be great if those could be covered too.
After some fiddling it wasn't that hard in the end:
select distinct on (o.editionid) o.title as flawed, e.title as good
from editions o
inner join editions e on (o.bookid = e.bookid)
where o.bookid between 0 and 10000
and e.title ~ '[^\x00-\x7F]'
and (
o.title = regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g')
or o.title = regexp_replace(e.title, '[^\x00-\x7F]', ' ', 'g')
)
regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g')
is the key where \x00-\x7F
are all Unicode characters not in ASCII scheme and 'g'
keeps on searching for more hits in the same string.