What process is to_tsvector
doing to strip tags and is it aware of <script>
tags?
SELECT to_tsvector('simple', '<textarea>banana</textarea>'); -- 'banana':1
SELECT to_tsvector('simple', '<script>banana</script>'); -- empty
Where is it documented?
It's just <script>
and <style>
tags, the default parser is searching for these two specifically and reassigns the tokens as whitespace symbols. You can see that using ts_debug()
:
demo at db<>fiddle
select v
, to_tsvector('simple', v)
, ts_debug.*
from(values('<script>banana</script>')
,('<style>banana</style>')
,('<asdf>banana</asdf>'))_(v)
cross join lateral ts_debug(v)
order by 1;
v | to_tsvector | alias | description | token | dictionaries | dictionary | lexemes |
---|---|---|---|---|---|---|---|
<asdf>banana</asdf> | 'banana':1 | tag | XML tag | </asdf> | {} | null | null |
<asdf>banana</asdf> | 'banana':1 | asciiword | Word, all ASCII | banana | {english_stem} | english_stem | {banana} |
<asdf>banana</asdf> | 'banana':1 | tag | XML tag | <asdf> | {} | null | null |
<script>banana</script> | tag | XML tag | <script> | {} | null | null | |
<script>banana</script> | blank | Space symbols | banana | {} | null | null | |
<script>banana</script> | tag | XML tag | </script> | {} | null | null | |
<style>banana</style> | tag | XML tag | </style> | {} | null | null | |
<style>banana</style> | blank | Space symbols | banana | {} | null | null | |
<style>banana</style> | tag | XML tag | <style> | {} | null | null |
It's just these two tags, as you can see in postgres/src/backend/tsearch /wparser_def.c:563
:
static void
SpecialTags(TParser *prs)
{
switch (prs->state->lenchartoken)
{
case 8: /* </script */
if (pg_strncasecmp(prs->token, "</script", 8) == 0)
prs->ignore = false;
break;
case 7: /* <script || </style */
if (pg_strncasecmp(prs->token, "</style", 7) == 0)
prs->ignore = false;
else if (pg_strncasecmp(prs->token, "<script", 7) == 0)
prs->ignore = true;
break;
case 6: /* <style */
if (pg_strncasecmp(prs->token, "<style", 6) == 0)
prs->ignore = true;
break;
default:
break;
}
}