I am designing an automated trading software for the foreign exchange market. In a MYSQL database I have years of market data at five-minute intervals. I have 4 different metrics for this data alongside the price and time.
[Time|Price|M1|M2|M3|M4]
x ~400,0000
Time
is the primary key, and M1
through M4
are different metrics (such as standard deviation or slope of a moving average).
Here is a real example (excerpt:)
+------------+--------+-----------+--------+-----------+-----------+
| Time | Price | M1 | M2 | M3 | M4 |
+------------+--------+-----------+--------+-----------+-----------+
| 1105410300 | 1.3101 | 12.9132 | 0.4647 | 29.6703 | 50 |
| 1105410600 | 1.3103 | 14.056 | 0.5305 | 29.230801 | 50 |
| 1105410900 | 1.3105 | 15.3613 | 0.5722 | 26.8132 | 25 |
| 1105411200 | 1.3106 | 16.627501 | 0.4433 | 24.395599 | 26.47059 |
| 1105411500 | 1.3112 | 18.7843 | 1.0019 | 24.505501 | 34.375 |
| 1105411800 | 1.3111 | 19.8375 | 0.5626 | 20 | 32.8125 |
| 1105412100 | 1.3105 | 20.0168 | 0.6718 | 9.7802 | 23.4375 |
| 1105412400 | 1.3105 | 20.4538 | 0.8943 | 7.033 | 23.4375 |
| 1105412700 | 1.3109 | 21.6078 | 0.4902 | 11.7582 | 29.6875 |
| 1105413000 | 1.3104 | 21.2045 | 1.565 | 8.6813 | 21.875 |
+------------+--------+-----------+--------+-----------+-----------+...400k more
Given an input of M1
, M2
, M3
, and M4
I want (quickly and accurately) find the 5,000 closest matches.
Sample input:
+------------+--------+-----------+--------+-----------+-----------+
| Time | Price | M1 | M2 | M3 | M4 |
+------------+--------+-----------+--------+-----------+-----------+
| 1205413000 | 1.4212 | 20.1045 | 1.0012 | 9.1013 | 11.575 |
+------------+--------+-----------+--------+-----------+-----------+
I figured that each of these metrics could be considered a 'dimension,' and that I can do a nearest neighbor search
to locate the closest datapoints in this multidimensional space.
It seems the simplest way to do this is to iterate through every single data point and measure the multidimensional distance to my input point; but speed is of the essence!
I read about something called K-D Trees
used for this purpose. Can anyone please explain or provide me with some material that explains how to implement this in MYSQL?
It may be relevant to mention that I can pre-process the table, but the input is received in real-time.
Currently I just make a rough cluster around the data on each dimension independently:
INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 < currentM1 ORDER BY M1 DESC LIMIT 2500;
INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 > currentM1 ORDER BY M1 ASC LIMIT 2500;
INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 < currentM2 ORDER BY M2 DESC LIMIT 2500;
INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 > currentM2 ORDER BY M2 ASC LIMIT 2500;
INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 < currentM3 ORDER BY M3 DESC LIMIT 2500;
INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 > currentM3 ORDER BY M3 ASC LIMIT 2500;
INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 < currentM4 ORDER BY M4 DESC LIMIT 2500;
INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 > currentM4 ORDER BY M4 ASC LIMIT 2500;
It is important to understand that I am interested in distance by rank, not by value.
Edit: I am a little closer to understanding how to do it (I think):
I need to pre-process each row of each metric and assign it a percentile
which would represent its location (percent-wise) in its range.
For example, for any given value of M1
:
percentile = (# rows with values less than input)/(# total rows)
If I calculate the input's percentile and use that for a nearest neighbor search instead of the actual value I will have effectively scaled the various metrics such that they could be used as dimensions.
I am still lost on how to do the actual search though. Is this even possible to accomplish efficiently in MySQL?
You should be able to do a query like the following:
SELECT * FROM myTable
WHERE M1 BETWEEN searchM1 - radiusM1 AND searchM1 + radiusM1
AND M2 BETWEEN searchM2 - radiusM2 AND searchM2 + radiusM2
AND M3 BETWEEN searchM3 - radiusM3 AND searchM3 + radiusM3
AND M4 BETWEEN searchM4 - radiusM4 AND searchM4 + radiusM4
In the case of a sphere, all the radius
values will be the same, of course. You then adjust the radius until you get as close to the number of records you want. I'd suggest a binary search.
I'm not sure if you want to mess with the distribution or not, but assuming you do, you would just need to give each search value a rank between the two values it would fall between in your table (e.g. if rank 5 is 5.5, rank 6 is 5.9, and the search value is 5.6, then the search rank could be 5.5)