A sample document and tsvector
out of it:
select
*
from
unnest(to_tsvector('english', 'something wide this more wider and wider social-economy wide somethings'))
lexeme | positions |
---|---|
economi | 10 |
social | 9 |
social-economi | 8 |
someth | 1,12 |
wide | 2,11 |
wider | 5,7 |
How do I interpret these positions on a client? The docs says that:
A position normally indicates the source word's location in the document.
(emphasis is mine)
This 'normally' part is what confuses me.
A location in my understanding is a starting symbol index in a document. But here it seems to be just a lexeme's order index, not symbol index or even word order index.
I need to highlight the lexemes in a source document just like they are processed by Postgres (without ts_headline
). But using the sample above word wide
should be found at locations 2
and 11
. However, since Postgres uses stemming, stop-word dictionaries, etc. the real word locations are different.
So how do I 'map' lexeme's positions to source document to highlight them? On a client. I need Postgres to just return the data it uses internally and interpret it somehow.
Basically, in the end I need to see something like this:
something wide this more wider and wider social-economy wide somethings
My first approach was to split the source document by whitespaces in tokens on a client, retrieve the tsvector
, extract lexemes and compare each token with a lexeme using something like a string.StartsWith
(currently on C#
). But the problem is that some lexemes are a bit different from the source document's words. Note the 'economi' lexeme vs 'economy' token in source document. Also in a real project there are some extra synonyms used and so using string.StartsWith
won't work.
That's why I needed real symbol positions. Is it possible to get them somehow?
Here is a sample of how I tried to split the source on a client to map locations (using C#
):
var source = "something wide this more wider and wider social-economy wide somethings";
source
.Split(new[] { ' ', '-' }, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
.Select((w, i) => new
{
Word = w,
Pos = i + 1,
})
.OrderBy(w => w.Word)
;
lexeme | positions |
---|---|
and | 6 |
economy | 9 |
more | 4 |
social | 8 |
something | 1 |
somethings | 11 |
this | 3 |
wide | 2 |
wide | 10 |
wider | 5 |
wider | 7 |
Some of them do match, others mostly not.
I see Postgres considering social-economi(y) as one word first and providing the position as 8 and then social as 9 and economi 10 (repeating it as a different word and doing plus 1) and the remaining words have +1 added to it followed by economi(y). As @KasbolatKumakhov said we can use \W regex pattern to replace all non word characters with spaces and use that.
SELECT *
FROM unnest(to_tsvector('english',
regexp_replace('something;wide {}{}{ this.more wider and\wider social-economy wide/somethings
wide;somethings wide&somethings wide|somethings wide+somethings wide"somethings wide.somethings
wide=somethings wide#somethings wide@somethings wide*somethings
wide~somethings wide$somethings wide:somethings wideËsomethings wide wideËsomethings',
E'\\W', ' ', 'g')));