Search code examples
sql.netpostgresql.net-6.0sanitization

Sanitizing and preparation of text search string for dapper query


We have a table of following structure in postgresql (simplified):

CREATE TABLE items(
    id bigint NOT NULL DEFAULT nextval('item_id_seq') PRIMARY KEY,
    name varchar(40) not null,
    name_search tsvector GENERATED ALWAYS AS (to_tsvector('simple', name)) STORED
);

Using C# as the programming language I would like to make a full text search on name_search field. Without full ORM like EntifyFramework.

Dapper is used and the current query/logic looks like this:

    public async Task<IEnumerable<FeedItemInDb>> GetItems(string searchTerm)
    {
        string searchFilter = "";
        if (!string.IsNullOrEmpty(searchTerm))
        {
            string searchTermProcessed = $"{searchTerm}:*";
            searchTermProcessed = searchTermProcessed.Replace(" ", " & ");

            searchFilter = $"AND i.name_search @@ to_tsquery('simple', '{searchTermProcessed}')";
        }

        var results = await uow.Connection.QueryAsync<FeedItemInDb>($@"select i.* FROM items i WHERE 1=1 {searchFilter}", new
        {
            // params here
        });

        return results;
    }

It works OK for very trivial cases. For example search string my test is sanitized to my & test:*

There is one main flaw with this approach - you have to know in advance all the sanitization rules required for query! As an example, following raw input my :*test ends up with exception:

Npgsql.PostgresException: 42601: syntax error in tsquery: "my & :test:"

I was wondering if there is some kind of package or rules or code in Dapper that would do all of the required sanitization work for me? Similarly to how we can parametrize other values in query...

Normally, using Dapper, I would expect the code to look like this:

    public async Task<IEnumerable<FeedItemInDb>> GetItems(string searchTerm)
    {

        var results = await uow.Connection.QueryAsync<FeedItemInDb>($@"select i.* FROM items i
    WHERE i.name_search @@ to_tsquery('simple', '@SearchParam:*')", new
        {
            SearchParam = searchTerm
        });

        return results;
    }

But it does not return any results, unfortunately. Nor with to_tsquery('simple', '@SearchParam').

In general I just want to know how to solve this problem. How to sanitize the string for the full text search using Dapper. If the user starts passing :,.%&* in a query I would expect my code to start failing if left as is.. Should I maybe prohibit/filter out all special characters from user input? The problem is I have no idea what I should be filtering out..

EDIT: to_tsquery('simple', @SearchParam) does seem to work, if I format the search string manually before the search.. So essentially I have the same problem as before. If the string is not properly formatted then sql exception is thrown.. Therefore you have to know and apply in advance all formatting/sanitization rules so the query would not fail. So I still have the same question on how to deal with this situation.


Solution

  • While there is no answer, it found it "somewhat doable" to just replace non-supported characters with a regular expression like so:

        private static string CorrectInputString(string input)
        {
            string result = input?.Trim();
    
            if (!string.IsNullOrEmpty(result))
            {
                // remove newlines and tabs
                result = Regex.Replace(result, @"\t|\n|\r", "");
    
                // remove not-supported characters (supported are: numbers, regular letters, hyphens, spaces)
                result = Regex.Replace(result, "[^\p{L}0-9- ]", "");
    
                // remove double spaces (also trims)
                result = string.Join(" ", result.Split(' ', StringSplitOptions.RemoveEmptyEntries));
            }
    
            return result;
        }
    

    then

        public static string PrepareSearchString(string input)
        {
            string result = input?.Trim();
    
            if (!string.IsNullOrEmpty(result))
            {
                // prepare query #1
                result = $"{result}:*";
    
                // prepare query #2
                result = result.Replace(" ", "&");
            }
    
            return result;
        }
    

    I use 2 of these methods in sequence in a helper class.

    This of course removes quite a lot of information from the initial input string, exactly what I wanted to avoid doing.. There seems to be another solution for full-text search using GIST and GIN indexes explained here (Postgresql prefix wildcard for full text). It might work better for the given use-case.

    I have decided to leave the solution as is (if there is no better answer).

    p.s. here is the unit-test that covers all kinds of transformations:

        [TestMethod]
        public async Task Sanitize_DisplayName()
        {
            var sourceExpected = new List<(string source, string expected)>()
            {
                ("1", "1"),
                ("test", "test"),
                ("test1", "test1"),
                ("test 1", "test 1"),
                ("test-1", "test-1"),
                ("test-test", "test-test"),
                ("test-test test", "test-test test"),
                ("1 test-TEST test", "1 test-TEST test"),
                ("тест", "тест"),
                ("  тест", "тест"),
                (@"
            тест", "тест"), // ENTER here!
                ("test*/!#%^()[]{}", "test"),
                ("te st*/", "te st"),
                ("te  st*/", "te st"),
                ("te   st*/", "te st"),
                ("\t te    st */", "te st"),
                (" te     st */", "te st"),
                ("test ? &", "test"),
                ("test ? &-", "test -"), // !
            };
    
            foreach (var item in sourceExpected)
            {
                var res = SqlHelper.CorrectInputString(item.source);
    
                Assert.AreEqual(item.expected, res);
            }
        }