Search code examples
phpmysqlhaversine

Fast nearest-location finder for SQL (MySQL, PostgreSQL, SQL Server)


Can someone help make the following query joins instead of sub-selects? It is from this tutorial: http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/.

Implementing this turns out to be super slow with a large number of rows (4 million). I'm thinking that the sub-selects are the root cause, but I can't figure out how to make it into joins.

SELECT 
    zip,
    primary_city,
    latitude,
    longitude,
    distance
FROM 
    (
        SELECT 
            z.zip,
            z.primary_city,
            z.latitude,
            z.longitude,
            p.radius,
            p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) AS distance
        FROM zip AS z
            JOIN 
                (
                    /* these are the query parameters */
                    SELECT 
                        42.81 AS latpoint,
                        -70.81 AS longpoint,
                       50.0 AS radius,
                       111.045 AS distance_unit
                ) AS p ON 1 = 1
        WHERE 
            z.latitude BETWEEN p.latpoint - (p.radius / p.distance_unit)
                AND p.latpoint + (p.radius / p.distance_unit)
            AND z.longitude BETWEEN p.longpoint - (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
                AND p.longpoint + (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
    ) AS d
WHERE distance <= radius
ORDER BY distance 
LIMIT 15

Solution

  • I don't believe there is anything you can do here. I imagine that your long running time is coming from the extraordinary amount of CPU necessary to process 4million records through all these calculations.

    Your innermost subquery is just 4 constants that are cross joined to your main subquery, so there isn't anything you can do here to help speed it up. It would be a wash.

    Your main subquery (the big SELECT statement) is doing all the work here and is wrapped in the main query to save processing, since distance would need to be calculated three times, unless mysql's optimizer performed some kind of miracle and recognized that the calculation was used three times.

    At any rate, here is probably a worse performing version that removes the outer most query:

    SELECT 
        z.zip,
        z.primary_city,
        z.latitude,
        z.longitude,    
        p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) AS distance
    FROM zip AS z
        JOIN 
            (
                /* these are the query parameters */
                SELECT 
                    42.81 AS latpoint,
                    -70.81 AS longpoint,
                   50.0 AS radius,
                   111.045 AS distance_unit
            ) AS p ON 1 = 1
    WHERE 
        z.latitude BETWEEN p.latpoint - (p.radius / p.distance_unit)
            AND p.latpoint + (p.radius / p.distance_unit)
        AND z.longitude BETWEEN p.longpoint - (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
            AND p.longpoint + (p.radius / (p.distance_unit * COS(RADIANS(p.latpoint))))
    WHERE p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) <= p.radius
    ORDER BY p.distance_unit * DEGREES(ACOS(COS(RADIANS(p.latpoint)) * COS(RADIANS(z.latitude)) * COS(RADIANS(p.longpoint - z.longitude)) + SIN(RADIANS(p.latpoint)) * SIN(RADIANS(z.latitude)))) LIMIT 15
    

    You can see that all instances of z.distance were replaced with the formula that is used to calculate distance in the WHERE and the ORDER BY parts of the query.

    If you wanted to do away with the Cross Joined subquery that is holding the constants... you could do that too, but now you are losing performance with the last change, and losing readability with the loss of the Cross Join:

    SELECT 
        z.zip,
        z.primary_city,
        z.latitude,
        z.longitude,    
        111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) AS distance
    FROM zip AS z  
    WHERE 
        z.latitude BETWEEN 42.81 - (50.0 / 111.045)
            AND 42.81 + (50.0 / 111.045)
        AND z.longitude BETWEEN -70.81 - (50.0 / (111.045 * COS(RADIANS(42.81))))
            AND -70.81 + (50.0 / (111.045 * COS(RADIANS(42.81))))
    WHERE 111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) <= 50.0
    ORDER BY 111.045 * DEGREES(ACOS(COS(RADIANS(42.81)) * COS(RADIANS(z.latitude)) * COS(RADIANS(-70.81 - z.longitude)) + SIN(RADIANS(42.81)) * SIN(RADIANS(z.latitude)))) LIMIT 15
    

    So... in the end, it's an interesting exercise, but I don't think there are any pros, and there are definitely some cons to these changes.