python algorithm vector bioinformatics cartesian

Algorithm for placing a grid over a disordered set of points

Given a large set (tens of thousands up to millions) of disordered points represented as 3D Cartesian vectors, what's a good algorithm for making a regular square grid (of user-defined spacing) that encloses all of the points? Some constraints:

The grid needs to be square and regular
I need to be able to adjust the grid spacing (the length of a side of one of the squares), ideally with a single variable
I want a grid of minimum size, ie every 'block' in the grid should contain at least one of the disordered points, and every disordered point should be enclosed in a 'block'
The return value of the algorithm should be the list of coordinates of the grid points

To illustrate in 2D, given this set of points:

for some grid spacing X, one possible return value of the algorithm would be the coordinates of these red points (dashed lines for illustration purposes only):

and for grid spacing X/2, one possible return value of the algorithm would be the coordinates of these red points (dashed lines for illustration purposes only):

For anyone who's interested, the disordered points that I'm working with are the atomic coordinates of large protein molecules, like what you can get out of a .pdb file.

Python is preferred for solutions, although pseudocode is also good.

EDIT: I think that my first description of what I needed was maybe a little fuzzy, so I added some constraints and images in order to clarify things.

Solution

Because you are asking for a regular square grid of user-specified spacing, it sounds like a reasonably straightforward approach should work.

Start by passing through the data to work out the minimum and maximum co-ordinate in each dimension. Work out the number of steps of user-specified spacing required to cover the distance between maximum and minimum.

Pass through the data again to allocate each point to a cell in the grid, using a grid with a point at the minimum of each co-ordinate and the specified spacing (e.g. X_cell = Math.floor((x_i - x_min) / spacing)). Use a dictionary or an array to record the number of points in each cell.

Now print out the co-ordinates of the cells with at least one point in them.

You do have some freedom that I have not attempted to optimise: unless the distance between minimum and maximum co-ordinate is an exact multiple of the grid spacing, there will be some slop that allows you to slide the grid around and still have it contain all the points: at the moment the grid starts at the position of the lowest point, but it probably ends before the highest points, so you have room to move it down a little in each dimension. As you do this, some points will move from cell to cell, and the number of occupied cells will change.

If you consider only moves in one dimension at a time, you can work out what will happen reasonably efficiently. Work out the distance in that dimension between each point and the maximum co-ordinate in that dimension of its cell, and then sort these values. As you move the grid down, the point with the smallest distance to its maximum co-ordinate will swap cells first, and you can iterate through these points one by one by moving through them in sorted order. If you update the counts of points in cells as you do this you can work out which shift minimises the number of occupied cells.

Of course, you have three dimensions to worry about. You could work on them one at a time until you getting reductions in the number of cells. This is a local minimum, but may not be a global minimum. One way to look for other local minima is to start again from a randomly chosen starting point.