I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle
and ArticleText
. When I want to search an article for a particular term, i started out with
column LIKE '%term%'
.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*>
which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'
.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term-
or term's
, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle
and ArticleText
as HTML-free columns? Then i could use '%term%'
without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE
allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%'
query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.