First of all, I'll preface this by saying I have minimal experience with SQL.
Anyways, I have a table in MemSQL of the following format:
+-----+--------+----------+----------+
| id | uuid | identity | template |
+-----+--------+----------+----------+
| int | string | string | blob |
+-----+--------+----------+----------+
I am trying to use the MemSQL DOT_PRODUCT
feature to obtain the identity
of the template which generates the maximum dot product against a probe vector that I have provided. Note template is a normalized array of floats of fixed length.
My SQL statement is as follows:
SELECT id, identity, MAX(DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]')))
AS score FROM collection;
However, I seem to be experiencing strange behavior, where I am getting inconsistent results (1 out of 10 times that I execute the query I will get a different identity but always the same max score). Additionally, the identity
is incorrect (see further below).
The result from the query is the following (9 out of 10 times):
+----+-------------+------------------+
| id | identity | score |
+----+-------------+------------------+
| 7 | armstrong_2 | 0.56488848850131 |
+----+-------------+------------------+
As a sanity check, I wrote the following SQL statement, expecting for the max to be the same. Note, I am using the exact same probe vector from before:
SELECT id, identity, DOT_PRODUCT(template, JSON_ARRAY_PACK('[<same probe template from before>]'))
AS score FROM collection ORDER BY score DESC;
The results are as follows:
+----+--------------+--------------------+
| id | identity | score |
+----+--------------+--------------------+
| 1 | armstrong_1 | 0.56488848850131 |
| 21 | armstrong_1 | 0.56488848850131 |
| 6 | armstrong_1 | 0.56488848850131 |
| 11 | armstrong_1 | 0.56488848850131 |
| 16 | armstrong_1 | 0.56488848850131 |
| 17 | armstrong_2 | 0.534708674997091 |
| 7 | armstrong_2 | 0.534708674997091 |
| 22 | armstrong_2 | 0.534708674997091 |
| 2 | armstrong_2 | 0.534708674997091 |
| 12 | armstrong_2 | 0.534708674997091 |
| 10 | mr_bean_2 | 0.072085081599653 |
| 15 | mr_bean_2 | 0.072085081599653 |
| 5 | mr_bean_2 | 0.072085081599653 |
| 20 | mr_bean_2 | 0.072085081599653 |
| 25 | mr_bean_2 | 0.072085081599653 |
| 14 | mr_bean | 0.037121964152902 |
| 9 | mr_bean | 0.037121964152902 |
| 4 | mr_bean | 0.037121964152902 |
| 19 | mr_bean | 0.037121964152902 |
| 24 | mr_bean | 0.037121964152902 |
| 13 | jimmy_carter | -0.011749440804124 |
| 23 | jimmy_carter | -0.011749440804124 |
| 18 | jimmy_carter | -0.011749440804124 |
| 8 | jimmy_carter | -0.011749440804124 |
| 3 | jimmy_carter | -0.011749440804124 |
+----+--------------+--------------------+
What is going on? Why is the MAX
identity from the first query not the same as the max identity (top row) for the second query? Is one / both of my query statements incorrect?
Additionally, when I compute the dot product by hand (without any SQL or MemSQL), I find that armstrong_1
does indeed produce the highest score of 0.56488848850131
. So why is my first SQL query (with the MAX
operator) not working?
This is simply not valid SQL:
SELECT id, identity, MAX(DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]'))) AS score
FROM collection;
You have no GROUP BY
but the query is an aggregation function (due to the MAX()
. Then there are two other columns. This is not valid SQL and it is sad that some databases allow it.
The best approach is ORDER BY
:
SELECT id, identity, DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]')) AS score
FROM collection
ORDER BY score DESC
LIMIT 1; -- or whatever your database uses to limit to one row