Can someone help make the following query joins instead of sub-selects? It is from this tutorial: http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/.
Implementing this turns out to be super slow with a large number of rows (4 million). I'm thinking that the sub-selects are the root cause, but I can't figure out how to make it into joins.
SELECT
zip,
primary_city,
latitude,
longitude,
distance
FROM
(
SELECT
z.zip,
z.primary_city,
z.latitude,
z.longitude,
p.radius,
p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) AS distance
FROM zip AS z
JOIN
(
/* these are the query parameters */
SELECT
42.81 AS latpoint,
-70.81 AS longpoint,
50.0 AS radius,
111.045 AS distance_unit
) AS p ON 1 = 1
WHERE
z.latitude BETWEEN p.latpoint - (p.radius / p.distance_unit)
AND p.latpoint + (p.radius / p.distance_unit)
AND z.longitude BETWEEN p.longpoint - (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
AND p.longpoint + (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
) AS d
WHERE distance <= radius
ORDER BY distance
LIMIT 15
I don't believe there is anything you can do here. I imagine that your long running time is coming from the extraordinary amount of CPU necessary to process 4million records through all these calculations.
Your innermost subquery is just 4 constants that are cross joined to your main subquery, so there isn't anything you can do here to help speed it up. It would be a wash.
Your main subquery (the big SELECT statement) is doing all the work here and is wrapped in the main query to save processing, since distance would need to be calculated three times, unless mysql's optimizer performed some kind of miracle and recognized that the calculation was used three times.
At any rate, here is probably a worse performing version that removes the outer most query:
SELECT
z.zip,
z.primary_city,
z.latitude,
z.longitude,
p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) AS distance
FROM zip AS z
JOIN
(
/* these are the query parameters */
SELECT
42.81 AS latpoint,
-70.81 AS longpoint,
50.0 AS radius,
111.045 AS distance_unit
) AS p ON 1 = 1
WHERE
z.latitude BETWEEN p.latpoint - (p.radius / p.distance_unit)
AND p.latpoint + (p.radius / p.distance_unit)
AND z.longitude BETWEEN p.longpoint - (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
AND p.longpoint + (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
WHERE p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) <= p.radius
ORDER BY p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) LIMIT 15
You can see that all instances of z.distance
were replaced with the formula that is used to calculate distance in the WHERE
and the ORDER BY
parts of the query.
If you wanted to do away with the Cross Joined subquery that is holding the constants... you could do that too, but now you are losing performance with the last change, and losing readability with the loss of the Cross Join:
SELECT
z.zip,
z.primary_city,
z.latitude,
z.longitude,
111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) AS distance
FROM zip AS z
WHERE
z.latitude BETWEEN 42.81 - (50.0 / 111.045)
AND 42.81 + (50.0 / 111.045)
AND z.longitude BETWEEN -70.81 - (50.0 / (111.045 * COS(RADIANS(42.81))))
AND -70.81 + (50.0 / (111.045 * COS(RADIANS(42.81))))
WHERE 111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) <= 50.0
ORDER BY 111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) LIMIT 15
So... in the end, it's an interesting exercise, but I don't think there are any pros, and there are definitely some cons to these changes.