Search code examples
regexpostgresqlregex-lookarounds

Handle initials in Postgresql


I'd like to keep together initials (max two letters) when there are punctuation or spaces in between.

I have the following snippet to tackle almost everything, but I am having issues in keeping together initials that are separated by punctuation and space. For instance, this is working on regular regex, but not in postgresql:

SELECT regexp_replace('R Z ELEMENTARY SCHOOL', 
                      '(\b[A-Za-z]{1,2}\b)\s+\W*(?=[a-zA-Z]{1,2}\b)', 
                      '\1')

The outcome should be "RZ ELEMENTARY SCHOOL". Other examples will include:

A & D ALTERNATIVE EDUCATION
J. & H. KNOWLEDGE DEVELOPMENT
A. - Z. EVOLUTION IN EDUCATION

The transformation should be as follows:

AD ALTERNATIVE EDUCATION
JH KNOWLEDGE DEVELOPMENT
AZ EVOLUTION IN EDUCATION

How to achieve this in Postgresql?

Thanks


Solution

  • Building on your current regex, I can recommend

    SELECT REGEXP_REPLACE(
        REGEXP_REPLACE('J. & H. KNOWLEDGE DEVELOPMENT', '\m([[:alpha:]]{1,2})\M\s*\W*(?=[[:alpha:]]{1,2}\M)', '\1'),
        '^([[:alpha:]]+)\W+',
        '\1 '
    )
    

    See the online demo, yielding

        regexp_replace
    1   JH KNOWLEDGE DEVELOPMENT
    

    It is a two step solution. The first regex matches

    • \m([[:alpha:]]{1,2})\M - a whole one or two letter words captured into Group 1 (\m is a leading word boundar, and \M is a trailing word boundary)
    • \s* - zero or more whitespaces
    • \W* - zero or more non-word chars
    • (?=[[:alpha:]]{1,2}\M) - a positive lookahead that requires a whole one or two letter word immediately to the right of the current position.

    The match is replaced with the contents of Group 1 (\1).

    The second regex matches a letter word at the start of the string and replaces all non-word chars after it with a space.