Search code examples
postgresqlsphinx

How does Sphinx handle URLs


When working with PostgreSQL you can break apart a URL into several different lexemes when using full text search. For example:

SELECT to_tsvector('http://www.example.com/dir/page.html');
                               to_tsvector                                
--------------------------------------------------------------------------
 '/dir/page.html':3 'www.example.com':2 'www.example.com/dir/page.html':1
(1 row)

You can see that PostgreSQL has broken up http://www.example.com/dir/page.html into the url minus the protocol (www.example.com/dir/page.html), host (www.example.com) and the url_path (/dir/page.html). This is handy because it will allow you to quickly search for www.example.com.

With that background, how does SphinxSearch handle indexing a URL? Does it behave similarly to PostgreSQL in that it breaks apart a URL into parts so that it can be easily searched?


Solution

  • it literally just breaks up the source text using any charactors not listed in charset_table

    so normally . and / just count as seperators so a url will just be searchable by the groups of letters - usefully combined with phrase operator