I am an amateur user of SQL and I want to construct a quite advances SQL query for detecting the most similar object within a database in Access
More precisely my database has the following strucure
FID PC1 PC2 PC3 PC4
1 0.765 0.043 -0.0023 -0.0009
5 0.223 -0.09 -0.0045 -0.0002
300
...
...
...
In this structure I would like detect per object (FID) which other (FID) is most similar to this one by considering all 4 PC attributes values (PC1,PC2,PC3,PC4).
I can think a mathematical solution for this problem but I am not sure how to implement it in an SQL query. Can somebody advice me on this ?
Thanks in advance
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The exact query that I am running looks like this:
SELECT
Merged_Info.TARGET_FID,
Merged_Info.PC1_MEAN,
Merged_Info.PC2_Mean,
Merged_Info.PC3_MEAN,
Merged_Info.PC4_MEAN,
([PC1_MEAN]+0.7*[PC2_Mean]+0.4*[PC3_MEAN]+0.2*[PC4_MEAN])/4 AS Rating
FROM Merged_Info
SELECT
q1.TARGET_FID,
(
SELECT TOP 1 q2.TARGET_FID
FROM myRatings AS q2
WHERE q2.TARGET_FID<>q1.TARGET_FID
ORDER BY Abs(q2.Rating-q1.Rating), q2.TARGET_FID
) AS Closest_TARGET_FID
FROM myRatings AS q1
Say we have sample data in a table named [myData]:
FID PC1 PC2 PC3 PC4
--- --- --- --- ---
1 1 3 5 2
2 4 4 4 0
3 5 3 1 1
4 9 9 8 7
We use some formula to give each row a "rating" (or "score") based on the values of [PC1] through [PC4]. For simplicity, we'll use the average value. So we create a saved query in Access named [myRatings] ...
SELECT
myData.FID,
myData.PC1,
myData.PC2,
myData.PC3,
myData.PC4,
([PC1]+[PC2]+[PC3]+[PC4])/4 AS Rating
FROM myData
... which returns ...
FID PC1 PC2 PC3 PC4 Rating
--- --- --- --- --- ------
1 1 3 5 2 2.75
2 4 4 4 0 3
3 5 3 1 1 2.5
4 9 9 8 7 8.25
Now we can use that query as the basis for another query which finds the FID with the closest rating
SELECT
q1.FID,
(
SELECT TOP 1 q2.FID
FROM myRatings AS q2
WHERE q2.FID<>q1.FID
ORDER BY Abs(q2.Rating-q1.Rating), q2.FID
) AS Closest_FID
FROM myRatings AS q1
which returns
FID Closest_FID
--- -----------
1 2
2 1
3 1
4 2
Notice that the ORDER BY clause of the subquery includes q2.FID as a "tie breaker". Without it the query would fail when it processed the row with FID=1 because both FID=2 and FID=3 have ratings that are 0.25 away from the rating for FID=1. In that case the subquery would return 2 rows despite the TOP 1
clause. (It's an "Access thing".) Adding the second ORDER BY condition ensures that the subquery will always return just one row.