Search code examples
sqlms-accessdetectsimilarity

Find the row with the next closest rating (score) for each row in a table


I am an amateur user of SQL and I want to construct a quite advances SQL query for detecting the most similar object within a database in Access

More precisely my database has the following strucure

                        FID     PC1       PC2      PC3        PC4
                         1      0.765     0.043    -0.0023    -0.0009
                         5      0.223     -0.09    -0.0045    -0.0002  
                        300
                        ...
                        ...
                        ...

In this structure I would like detect per object (FID) which other (FID) is most similar to this one by considering all 4 PC attributes values (PC1,PC2,PC3,PC4).

I can think a mathematical solution for this problem but I am not sure how to implement it in an SQL query. Can somebody advice me on this ?

Thanks in advance

~~~~~~~~~~~~~~~~~~~~~~~~~~~

The exact query that I am running looks like this:


 SELECT 
     Merged_Info.TARGET_FID, 
     Merged_Info.PC1_MEAN, 
     Merged_Info.PC2_Mean, 
     Merged_Info.PC3_MEAN, 
     Merged_Info.PC4_MEAN, 
     ([PC1_MEAN]+0.7*[PC2_Mean]+0.4*[PC3_MEAN]+0.2*[PC4_MEAN])/4 AS Rating
 FROM Merged_Info


SELECT
  q1.TARGET_FID,
  (
      SELECT TOP 1 q2.TARGET_FID 
      FROM myRatings  AS q2 
      WHERE q2.TARGET_FID<>q1.TARGET_FID 
      ORDER BY Abs(q2.Rating-q1.Rating), q2.TARGET_FID
  ) AS Closest_TARGET_FID
FROM myRatings  AS q1

Solution

  • Say we have sample data in a table named [myData]:

    FID  PC1  PC2  PC3  PC4
    ---  ---  ---  ---  ---
      1    1    3    5    2
      2    4    4    4    0
      3    5    3    1    1
      4    9    9    8    7
    

    We use some formula to give each row a "rating" (or "score") based on the values of [PC1] through [PC4]. For simplicity, we'll use the average value. So we create a saved query in Access named [myRatings] ...

    SELECT 
        myData.FID, 
        myData.PC1, 
        myData.PC2, 
        myData.PC3, 
        myData.PC4, 
        ([PC1]+[PC2]+[PC3]+[PC4])/4 AS Rating
    FROM myData
    

    ... which returns ...

    FID  PC1  PC2  PC3  PC4  Rating
    ---  ---  ---  ---  ---  ------
      1    1    3    5    2    2.75
      2    4    4    4    0       3
      3    5    3    1    1     2.5
      4    9    9    8    7    8.25
    

    Now we can use that query as the basis for another query which finds the FID with the closest rating

    SELECT 
        q1.FID,
        (
            SELECT TOP 1 q2.FID 
            FROM myRatings AS q2 
            WHERE q2.FID<>q1.FID 
            ORDER BY Abs(q2.Rating-q1.Rating), q2.FID
        ) AS Closest_FID
    FROM myRatings AS q1
    

    which returns

    FID  Closest_FID
    ---  -----------
      1            2
      2            1
      3            1
      4            2
    

    Notice that the ORDER BY clause of the subquery includes q2.FID as a "tie breaker". Without it the query would fail when it processed the row with FID=1 because both FID=2 and FID=3 have ratings that are 0.25 away from the rating for FID=1. In that case the subquery would return 2 rows despite the TOP 1 clause. (It's an "Access thing".) Adding the second ORDER BY condition ensures that the subquery will always return just one row.