algorithm haskell functional-programming hugs closest-points

Haskell to find the distance between the two closest points

Given a list of points in a two dimensional space, you want to perform a function in Haskell to find the distance between the two closest points. example: Input: project [(1,5), (3,4), (2,8), (-1,2), (-8.6), (7.0), (1.5), (5.5), (4.8), (7.4)] Output: 2.0

Assume that the distance between the two farthest points in the list is at most 10000.

Here´s my code:

import Data.List
import System.Random

sort_ :: Ord a => [a] -> [a]
sort_ []    =  []
sort_ [x]  =  [x]
sort_ xs   =  merge (sort_ left) (sort_ right)
  where
    (left, right) = splitAt (length xs `div` 2) xs
    merge [] xs = xs
    merge xs [] = xs
    merge (x:xs) (y:ys)=
    if x <= y then 
        x : merge xs (y:ys)
    else  y : merge (x:xs) ys     

project :: [(Float,Float)] -> Float
project [] = 0
project (x:xs)=
    if null (xs) then 
        error "The list have only 1 point"
    else head(sort_(dstList(x:xs)))

distance :: (Float,Float)->(Float,Float) -> Float
distance (x1,y1) (x2,y2) = sqrt((x1 - x2)^2 + (y1 - y2)^2)


dstList :: [(Float,Float)] -> [Float]
dstList (x:xs)=
    if length xs == 1 then 
        (dstBetween x xs):[]
    else (dstBetween x xs):(dstList xs)


dstBetween :: (Float,Float) -> [(Float,Float)] -> Float
dstBetween pnt (x:xs)=
    if null (xs) then 
        distance pnt x
    else  minimum ((distance pnt ):((dstBetween pnt xs)):[])

{-
Calling generator to create a file created at random points
-}
generator = do
    putStrLn "Enter File Name"
    file <- getLine
    g <- newStdGen
    let pts = take 1000 . unfoldr (Just . (\([a,b],c)->((a,b),c)) . splitAt 2) 
                $ randomRs(-1,1) g :: [(Float,Float)]
    writeFile file . show $ pts

{-
Call the main to read a file and pass it to the function of project
The function of the project should keep the name 'project' as described 
in the statement
-}
main= do
    putStrLn "Enter filename to read"
    name <- getLine
    file <- readFile name
    putStrLn . show . project $ readA file

readA::String->[(Float,Float)]
readA = read

I can perform a run of the program as in the example or using the generator as follows:

in haskell interpreter must type "generator", the program will ask for a file name containing a thousand points here. and after the file is generated in the Haskell interpreter must write main, and request a file name, which is the name of the file you create with "generator".

The problem is that for 1000 points randomly generated my program takes a long time, about 3 minutes on a computer with dual core processor. What am I doing wrong? How I can optimize my code to work faster?

Solution

You are using a quadratic algorithm:

project []  = error "Empty list of points"
project [_] = error "Single point is given"
project ps  = go 10000 ps
  where
    go a [_]    = a
    go a (p:ps) = let a2 = min a $ minimum [distance p q | q<-ps]
                  in a2 `seq` go a2 ps

You should use a better algorithm. Search computational-geometry tag on SO for a better algorithm.

@maxtaldykin proposes a nice, simple and effective change to the algorithm, which should make a real difference for random data -- pre-sort the points by X coordinate, and never try points more than d units away from the current point, in X coordinate (where d is the currently known minimal distance):

import Data.Ord (comparing)
import Data.List (sortBy)

project2 ps@(_:_:_) = go 10000 p1 t 
  where
    (p1:t) = sortBy (comparing fst) ps
    go d _         [] = d
    go d p1@(x1,_) t  = g2 d t
      where
        g2 d []          = go d (head t) (tail t)
        g2 d (p2@(x2,_):r)
           | x2-x1 >= d  = go d (head t) (tail t)
           | d2 >= d     = g2 d  r
           | otherwise   = g2 d2 r   -- change it "mid-flight"
               where
                 d2 = distance p1 p2

On random data, g2 will work in O(1) time so that go will take O(n) and the whole thing will be bounded by a sort, ~ n log n.

Empirical orders of growth show ~ n^2.1 for the first code (on 1k/2k range) and ~n^1.1 for the second, on 10k/20k range, testing it quick'n'dirty compiled-loaded into GHCi (with second code running 50 times faster than first for 2,000 points, and 80 times faster for 3,000 points).