I'm working on web app which enables the user to search within a source repository. The program parses the diffs. I can't find a way to inject all parts of the diff into the Postgres' fulltext vector.
Example:
select alias, description, token from ts_debug('Link to <a href="//www.yahoo.com">Yahoo!</a> web site');
+-----------+-----------------+----------------------------+
| alias | description | token |
+-----------+-----------------+----------------------------+
| asciiword | Word, all ASCII | Link |
| blank | Space symbols | |
| asciiword | Word, all ASCII | to |
| blank | Space symbols | |
| tag | XML tag | <a href="//www.yahoo.com"> |
| asciiword | Word, all ASCII | Yahoo |
| blank | Space symbols | ! |
| tag | XML tag | </a> |
| blank | Space symbols | |
| asciiword | Word, all ASCII | web |
| blank | Space symbols | |
| asciiword | Word, all ASCII | site |
+-----------+-----------------+----------------------------+
It seems to be parsed ok. But if I turn it into a document vector the XML tag won't be included.
select to_tsvector('simple', 'Link to <a href="//www.yahoo.com">Yahoo!</a> web site') to_tsvector;
+--------------------------------------------+
| to_tsvector |
+--------------------------------------------+
| 'link':1 'site':5 'to':2 'web':4 'yahoo':3 |
+--------------------------------------------+
I guess it has something to do with the configuration?
Any ideas?
The parser parses out tags, but the default configuration 'simple' ignores them (as can be seen in psql by running \dF+ simple
, tokens not listed are ignored).
You can tell it not to ignore them:
alter text search configuration simple add mapping for tag with simple;
But you would probably be better off copying the configuration and then modifying the copy.
You might also need a custom dictionary to process the tags, since the 'simple' dictionary is unlikely to do what you want.
select to_tsvector('simple', 'Link to <a href="//www.yahoo.com">Yahoo!</a> web site') to_tsvector;
to_tsvector
------------------------------------------------------------------------------------
'</a>':5 '<a href="//www.yahoo.com">':3 'link':1 'site':7 'to':2 'web':6 'yahoo':4