Search code examples
stringpostgresqlnon-ascii-characters

How to find string copies where one was stripped of non-ASCII characters


I have a table of book titles - most of them dublicate several times for different editions. Many titles were mistakenly imported with missing non-ASCII characters i.e. "La métamorphose" turned into "La m?tamorphose" sometimes the é turned into a space or was simply removed from string.

The table

editionid | bookid | title
--------------------------------------------
1         | 1      | Elementarne čestice
2         | 1      | Elementarne ?estice
3         | 1      | Elementarne estice
4         | 1      | Las partículas elementales
5         | 2      | Schöne neue Welt
6         | 2      | Sch ne neue Welt

I want to identify the incorrect titles by stripping the non-ASCIIs of titles and comparing to other titles of the same book. If there is a match I found a flawed title.

result:

o.title (flawed)    | e.title (good)
-----------------------------------
Elementarne ?estice | Elementarne čestice
Elementarne estice  | Elementarne čestice
Sch ne neue Welt    | Schöne neue Welt

The table is rather large but since I only need to do this once performance is not key.

My approach:

select distinct on (o.editionid) o.title, e.title
from editions o
inner join editions e on (o.bookid = e.bookid)
where o.bookid between 1 and 1000
    and e.title !~ '^[ -~]*$' -- only for performance
    and ((
      e.title like '%Þ%' and (o.title = regexp_replace(e.title, '[Þ]', '?') or o.title = regexp_replace(e.title, '[Þ]', ' ') or o.title = regexp_replace(e.title, '[Þ]', ''))
    ) or (
      e.title like '%ß%' and (o.title = regexp_replace(e.title, '[ß]', '?') or o.title = regexp_replace(e.title, '[ß]', ' ') or o.title = regexp_replace(e.title, '[ß]', ''))
    ) or (
      e.title like '%à%' and (o.title = regexp_replace(e.title, '[à]', '?') or o.title = regexp_replace(e.title, '[à]', ' ') or o.title = regexp_replace(e.title, '[à]', ''))
    .
    .
    .
    ))

That works so far but it appears impossible to add all non-ASCII characters separately. Does anyone have an idea of a more general approach which covers all non-ASCII chars at once?

Second - it doesn't work if two different characters where stripped and I don't know how to solve that.

And third but maybe impossible - often only some non-ASCIIs were turned but not all i.e. "Weiße Nächte" turned into "Wei e Nächte" - it would be great if those could be covered too.


Solution

  • After some fiddling it wasn't that hard in the end:

    select distinct on (o.editionid) o.title as flawed, e.title as good
    from editions o
    inner join editions e on (o.bookid = e.bookid)
    where o.bookid between 0 and 10000
        and e.title ~ '[^\x00-\x7F]'
        and (
                o.title = regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g') 
                or o.title = regexp_replace(e.title, '[^\x00-\x7F]', ' ', 'g')
            )
    

    regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g') is the key where \x00-\x7F are all Unicode characters not in ASCII scheme and 'g' keeps on searching for more hits in the same string.