I'm using sqlite's full-text-search support to store documents that may contain &
, <
and >
characters. I intended to use the snippet function to highlight the matches for an html results page, but I don't see an obvious way to escape the text before injecting the markers. I'd rather not escape the text before storing it, because then amp
, lt
and gt
will become tokens.
The simplest workaround I can think of is to escape the pages before storing them, and use a custom tokenizer that would first unescape &
, <
and >
.
However... since the default for snippet(foo)
is to use <b>
and </b>
, it really seems like this is a very common use case and I'm convinced that there must be a way to handle this without reinventing the wheel in C. Am I overlooking a more elegant solution?
It appears that FTS indeed does not handle markup in text (note how the <p>
gets chopped up):
> CREATE VIRTUAL TABLE test USING fts3(content TEXT);
> INSERT INTO test VALUES('<p>Isn''t this <font face="Comic Sans">funny</font>?');
> INSERT INTO test VALUES('blah');
> SELECT snippet(test) FROM test WHERE content MATCH 'funny';
p>Isn't this <font face="Comic Sans"><b>funny</b></font>?
> SELECT snippet(test) FROM test WHERE content MATCH 'font';
p>Isn't this <<b>font</b> face="Comic Sans">funny</<b>font</b>>?
The cleanest way to store the text is as plain text, unescaped. However, to get proper highlighting of search results, you have two choices:
snippet
to specify markers that are guaranteed to never occur in the text (which might not be possible), and convert those to <b>...
when you are escaping the text for HTML; oroffsets
function instead and insert the markers by hand.