Search code examples
postgresqlfull-text-searchtsvector

How to find real lexeme start position in source document from tsvector?


A sample document and tsvector out of it:

select
    *
from
    unnest(to_tsvector('english', 'something wide this more wider and wider social-economy wide somethings'))
lexeme positions
economi 10
social 9
social-economi 8
someth 1,12
wide 2,11
wider 5,7

How do I interpret these positions on a client? The docs says that:

A position normally indicates the source word's location in the document.

(emphasis is mine)

This 'normally' part is what confuses me.

A location in my understanding is a starting symbol index in a document. But here it seems to be just a lexeme's order index, not symbol index or even word order index.

I need to highlight the lexemes in a source document just like they are processed by Postgres (without ts_headline). But using the sample above word wide should be found at locations 2 and 11. However, since Postgres uses stemming, stop-word dictionaries, etc. the real word locations are different.

So how do I 'map' lexeme's positions to source document to highlight them? On a client. I need Postgres to just return the data it uses internally and interpret it somehow.

Basically, in the end I need to see something like this:

something wide this more wider and wider social-economy wide somethings

My first approach was to split the source document by whitespaces in tokens on a client, retrieve the tsvector, extract lexemes and compare each token with a lexeme using something like a string.StartsWith (currently on C#). But the problem is that some lexemes are a bit different from the source document's words. Note the 'economi' lexeme vs 'economy' token in source document. Also in a real project there are some extra synonyms used and so using string.StartsWith won't work.

That's why I needed real symbol positions. Is it possible to get them somehow?

UPDATE_01

Here is a sample of how I tried to split the source on a client to map locations (using C#):

var source = "something wide this more wider and wider social-economy wide somethings";
source
    .Split(new[] { ' ', '-' }, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
    .Select((w, i) => new
    {
        Word = w,
        Pos = i + 1,
    })
    .OrderBy(w => w.Word)
;
lexeme positions
and 6
economy 9
more 4
social 8
something 1
somethings 11
this 3
wide 2
wide 10
wider 5
wider 7

Some of them do match, others mostly not.


Solution

  • I see Postgres considering social-economi(y) as one word first and providing the position as 8 and then social as 9 and economi 10 (repeating it as a different word and doing plus 1) and the remaining words have +1 added to it followed by economi(y). As @KasbolatKumakhov said we can use \W regex pattern to replace all non word characters with spaces and use that.

    SELECT *
    FROM unnest(to_tsvector('english',
                 regexp_replace('something;wide {}{}{ this.more wider and\wider social-economy wide/somethings
                   wide;somethings wide&somethings wide|somethings wide+somethings wide"somethings wide.somethings 
                   wide=somethings wide#somethings wide@somethings wide*somethings
                   wide~somethings wide$somethings wide:somethings wideËsomethings wide wideËsomethings',
                                E'\\W', ' ', 'g')));