Search code examples
sqlmysqlfull-text-search

MySQL match against giving strange results


I'm using the following SQL:

SELECT id, cleanDescription, (MATCH (cleanDescription) AGAINST ('amazon com' IN BOOLEAN MODE)) AS score FROM Foo

And it's giving me following result set:

      [
        {
            "id": "1",
            "cleanDescription": "amazon prime pmts",
            "score": 4.788609504699707
        },
        {
            "id": "2",
            "cleanDescription": "amazon prime",
            "score": 4.788609504699707
        },
        {
            "id": "3",
            "cleanDescription": "amazon com",
            "score": 4.788609504699707
        },
        {
            "id": "4",
            "cleanDescription": "amazon com amzn com bi",
            "score": 4.788609504699707
        },
        {
            "id": "5",
            "cleanDescription": "amazon com amzn com bi",
            "score": 4.788609504699707
        }
    ]

It seems odd to me that it doesn't score id: 3 as the highest, seeing as it's an exact match for the string

Am I missing something obvious here?


Solution

  • In the given SQL query, you are using the MATCH AGAINST function with the IN BOOLEAN MODE modifier to perform a full-text search. The full-text search algorithm used in MySQL, called the Boolean Full-Text Search, ranks the search results based on relevance.

    In your specific case, it seems odd that the exact match "amazon com" is not scoring the highest. This behavior could be due to several factors:

    Word Length: By default, MySQL ignores words that are too short (less than four characters) in the full-text search. If "amazon" is considered a stop word, it might be ignored in the search, resulting in similar scores for different rows.

    Stop Words: MySQL has a predefined list of stop words that are ignored in full-text searches. If "com" is considered a stop word, it would also be ignored, resulting in similar scores for different rows.

    Relevance Algorithm: The ranking of search results is determined by the relevance algorithm used by MySQL. The algorithm takes into account factors such as word frequency, word proximity, and the inverse document frequency. If the search term "amazon com" appears frequently in other rows as well, it may not receive a significantly higher score.

    To investigate further, you can check the MySQL server configuration to see if any stop words are affecting the search results. Additionally, you can experiment with different search terms and observe the scores to gain a better understanding of how the relevance algorithm is working in your specific case.

    Here are a couple of examples to illustrate how the relevance algorithm in MySQL can be affected by stop words:

    Example 1: Consider the following table Products:

    id description
    1 Amazon Echo
    2 Amazon Fire TV Stick
    3 Echo Dot
    4 TV Stick
    5 Amazon Prime Membership

    Let's execute the following SQL query:

    SELECT id, description, MATCH (description) AGAINST ('Amazon' IN BOOLEAN MODE) AS score
    FROM Products;
    

    The result set will be:

    id description score
    1 Amazon Echo 0.8082903623580933
    2 Amazon Fire TV Stick 0.8082903623580933
    3 Echo Dot 0.0
    4 TV Stick 0.0
    5 Amazon Prime Membership 0.8082903623580933

    In this example, the word "Amazon" appears in multiple rows, but the scores are the same for all rows containing "Amazon." This behavior occurs because "Amazon" is considered a stop word by default in MySQL, so it is ignored during the search. As a result, the relevance score remains the same for all rows, regardless of the presence of "Amazon."

    Example 2: Let's consider a scenario where we have a table Posts with the following data:

    id title
    1 The importance of learning
    2 The art of painting
    3 Importance of continuous learning
    4 Artistic expression through painting
    5 The learning curve in painting

    Executing the following SQL query:

    SELECT id, title, MATCH (title) AGAINST ('The' IN BOOLEAN MODE) AS score
    FROM Posts;
    

    The result set will be:

    id title score
    1 The importance of learning 0.1150694864397049
    2 The art of painting 0.1150694864397049
    3 Importance of continuous learning 0.0
    4 Artistic expression through painting 0.0
    5 The learning curve in painting 0.1150694864397049

    In this example, the word "The" is a common stop word in MySQL. As a result, it is ignored during the search, and the rows containing "The" receive the same score, regardless of the specific context or relevance of the word in the title.

    These examples highlight how stop words can impact the relevance algorithm in MySQL and potentially affect the scoring of search results.