Search code examples
postgresqltextstring-metric

Postgresql: Processing Text, Detect out of Alphabetical order rows


I have some processed text that's in (mostly) alphabetical order, e.g. these are the first word of each paragraph:

  • Adelanto
  • Agoura Hills
  • Alameda
  • Albany
  • Old Albany
    • New Albany
  • Alhambra
  • Aliso Viejo
  • Alturas

So each of the words above represents the start of a paragraph e.g.:

Adelanto, a city in San Bernardino County, California about 9 miles (14 km) northwest of Victorville in the High Desert portion of the Inland Empire of the Greater Los Angeles Area...

The text can have many paragraphs per entry so that paragraphs not in alphabetical order are treated as new entries.

So each entry would correspond to a place.

In the Example, O(ld) is after A(lbany) so Old Albany is a Entry, but N(ew) is before O(ld), so New Albany a continuation of Old Albany.

My question is: Is there something already existing other than just using the ASCII character difference between the first letter of Albany and Old Albany/New Albany in Postgresql? E.g. ASCII ('A') - ASCII ('O') gives -14.

So do I just use ASCII values on the first characters? Or is there a more general solution?


Solution

  • Currently I'm using the ASCII difference between the first letters of the text, comparing to the previousRow.description and also nextRow.description e.g.

    ABS (ASCII (substring ( currentRow.description, 1,1 ) )  - 
    ASCII ( substring ( previousRow.description, 1 ,1 ) )