Search code examples
postgresql

Why does to_tsvector ignores HTML script tags?


What process is to_tsvector doing to strip tags and is it aware of <script> tags?

SELECT to_tsvector('simple', '<textarea>banana</textarea>'); -- 'banana':1
SELECT to_tsvector('simple', '<script>banana</script>'); -- empty

Where is it documented?


Solution

  • It's just <script> and <style> tags, the default parser is searching for these two specifically and reassigns the tokens as whitespace symbols. You can see that using ts_debug():
    demo at db<>fiddle

    select v
         , to_tsvector('simple', v)
         , ts_debug.*
    from(values('<script>banana</script>')
              ,('<style>banana</style>')
              ,('<asdf>banana</asdf>'))_(v)
    cross join lateral ts_debug(v)
    order by 1;
    
    v to_tsvector alias description token dictionaries dictionary lexemes
    <asdf>banana</asdf> 'banana':1 tag XML tag </asdf> {} null null
    <asdf>banana</asdf> 'banana':1 asciiword Word, all ASCII banana {english_stem} english_stem {banana}
    <asdf>banana</asdf> 'banana':1 tag XML tag <asdf> {} null null
    <script>banana</script> tag XML tag <script> {} null null
    <script>banana</script> blank Space symbols banana {} null null
    <script>banana</script> tag XML tag </script> {} null null
    <style>banana</style> tag XML tag </style> {} null null
    <style>banana</style> blank Space symbols banana {} null null
    <style>banana</style> tag XML tag <style> {} null null

    It's just these two tags, as you can see in postgres/src/backend/tsearch /wparser_def.c:563:

    static void
    SpecialTags(TParser *prs)
    {
        switch (prs->state->lenchartoken)
        {
            case 8:                 /* </script */
                if (pg_strncasecmp(prs->token, "</script", 8) == 0)
                    prs->ignore = false;
                break;
            case 7:                 /* <script || </style */
                if (pg_strncasecmp(prs->token, "</style", 7) == 0)
                    prs->ignore = false;
                else if (pg_strncasecmp(prs->token, "<script", 7) == 0)
                    prs->ignore = true;
                break;
            case 6:                 /* <style */
                if (pg_strncasecmp(prs->token, "<style", 6) == 0)
                    prs->ignore = true;
                break;
            default:
                break;
        }
    }