Search code examples
algorithmlanguage-agnosticmatchingsimilarity

Find similar records in dataset


I have a dataset of 25 integer fields and 40k records, e.g.

1:
  field1: 0
  field2: 3
  field3: 1
  field4: 2
  [...]
  field25: 1
2:
  field1: 2
  field2: 1
  field3: 4
  field4: 0
  [...]
  field25: 2

etc.

I'm testing with MySQL but am not tied to it.

Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.


Solution

  • Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):

    SELECT id,
    (
      ABS(field1-2)
      + ABS(field2-2)
      + ABS(field3-3)
      + ABS(field4-1)
      + ABS(field5-0)
      + ABS(field6-3)
      + ABS(field7-2)
      + ABS(field8-0)
      + ABS(field9-1)
      + ABS(field10-0)
      + ABS(field11-2)
      + ABS(field12-2)
      + ABS(field13-3)
      + ABS(field14-2)
      + ABS(field15-0)
      + ABS(field16-1)
      + ABS(field17-0)
      + ABS(field18-2)
      + ABS(field19-3)
      + ABS(field20-1)
      + ABS(field21-0)
      + ABS(field22-1)
      + ABS(field23-3)
      + ABS(field24-2)
      + ABS(field25-2)
    )/25
    AS distance 
    FROM mytable
    ORDER BY distance ASC
    LIMIT 20;