I am writing an algorithm on spatial data processing using an R-tree for a university project, where we are tasked to write a KNN algorithm as a range query using range estimation. We were not confined to a single programming language, so I chose C#.
I had set up the R-tree, the Point class and the base of the KNN algorithm, but I could not figure out how to implement an efficient range estimation function, so I searched online to find existing implementations. I came across a paper ("Efficient k Nearest Neighbor Queries on Remote Spatial Databases Using Range Estimation"), which proposed a density-based solution, where the range is estimated based on the density of the MBB (minimum bounding box) of the whole dataset.
Density = number of points in MBB / area of MBB
Here is part of my class where I've implemented the functions written in the paper:
public class KNNRangeQueryRTreeDensity {
... (constructor and pivate fields omitted)
private double GetDensityOfDatasetByGeometry(Geometry geom)
{
int numOfObjs = geom.NumPoints;
Envelope env = geom.EnvelopeInternal;
double area = (env.MaxX - env.MinX) * (env.MaxY - env.MinY);
return numOfObjs / area;
}
public double EstimateRangeWithDensity(int k, Geometry databaseGeometry)
{
return Math.Sqrt(k / (Math.PI * GetDensityOfDatasetByGeometry(databaseGeometry)));
}
public double ExpandRange(int k, Point queryPoint, Envelope prevEnv, int numOfNNRetrieved)
{
double range, envDensity;
if (numOfNNRetrieved == 0)
{
range = 2 * ((prevEnv.MaxX - prevEnv.MinX) / 2);
}
else
{
envDensity = numOfNNRetrieved / ((prevEnv.MaxX - prevEnv.MinX) * (prevEnv.MaxY - prevEnv.MinY));
range = Math.Sqrt(k / (Math.PI * envDensity));
}
return range;
}
public (List<(Point, double)>, double) KNNQuery(int k, Point queryPoint)
{
double dist;
double range = EstimateRangeWithDensity(k, _pointsGeometry); //perform range estimation
Envelope window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range); //create a new window (envelope) using the query point and range
IList<Point> resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
List<(Point, double)> nnList = new List<(Point, double)>(k); //create list with a capacity of k
int count = 0; //this keeps track of how many points are actually added to nnList
foreach (Point point in resultsList) //perform initial query based on the range estimation above
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
nnList.Add((point, dist)); //add a tuple consisting of the point and its distance to the query point to the list
count++;
}
}
/* the above range estimate is not loose and because of that it might fail (i.e. may return less than k)
* when that happens, we need to further refine the range estimate. */
while (count < k)
{
range = ExpandRange(k, queryPoint, window, count); //expand range
window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
//repeat above code block
resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
count = 0;
foreach (Point point in resultsList) //perform initial query based on the range estimation above
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
}
return (nnList, range);
}
}
}
(Note: for the R-tree
, Geometry
, Envelope
and Coordinate
objects I'm using the .NET Topology Suite. The Point
class is mine.)
Point
class:
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public Point(double x, double y)
{
X = x;
Y = y;
}
public double DistanceTo(Point o)
{
return Math.Sqrt(Math.Pow(X - o.X, 2) + Math.Pow(Y - o.Y, 2));
}
public Coordinate ToCoordinate()
{
return new Coordinate(X, Y);
}
}
After writing it I went ahead to test it with a dataset of 1024 points whose XY coordinates are randomly distributed within [0.0, 10.0)
, while setting k = 12
and a query point (0.5, 0.6)
:
ID X Y
1|1,629170883274251|5,007505367979177
2|0,7293811350732022|6,2009057338353735
3|4,087100198532967|8,718630060888188
4|6,081353833936785|0,508181154964576
... (up to point 1024)
Everything seems to run normally, however instead of returning 12 points, it returns 13.
Console output:
Enter filename with full path: M:\Visual Studio\DPTProj\rangedataset.txt
How many neighbors do you want to retrieve (k = ?)?
Type here: 12
Building R-tree...
R-tree built successfully.
Estimated range for k = 12: 0,8435869948686181
Points within radius 0,8435869948686181 of (0,5, 0,6):
1: (0,6843068314177482, 0,46912474579602703) | Distance to query point: 0,22604720805664638
2: (0,4947431154990304, 0,2580122092077565) | Distance to query point: 0,34202819165328435
3: (0,09012448605621443, 0,4170682841991392) | Distance to query point: 0,4488451287209534
4: (0,8195993866862726, 0,259394385041387) | Distance to query point: 0,46707167855863035
5: (0,28279345495756414, 0,1435370604244699) | Distance to query point: 0,5055067738568947
6: (0,10963706770429252, 0,2626442072273438) | Distance to query point: 0,5159381259683864
7: (0,12078106408975137, 0,17282920897604392) | Distance to query point: 0,5712108945537835
8: (1,0621745516835592, 0,7497310129691526) | Distance to query point: 0,5817728103008762
9: (0,6616472921621275, 1,2313257349800906) | Distance to query point: 0,6516916684379966
10: (1,1860976280579798, 0,6747156943542489) | Distance to query point: 0,6901538887883075
11: (0,06693545266377528, 0,04315583037359446) | Distance to query point: 0,705422094498358
12: (0,7070193349882119, 1,280761478133854) | Distance to query point: 0,7115428273617488
13: (0,19052830067953483, 1,2878140673450258) | Distance to query point: 0,754228694706058
Returned 13 points - range query executed with one or more errors
(The "with one or more errors" line above is displayed whenever the number of returned points is different from k
.)
This also happens with different selections for k
(e.g. k = 128
), where it returns 147 points.
As I was running it through the debugger it seems that the additional points are added in the foreach
loop inside the while
loop, with the latter being executed if the first range query retrieves less than k
points. Therefore, as the range gets expanded more points are added to resultsList
, which are unwanted:
while (count < k)
{
range = ExpandRange(k, queryPoint, window, count); //expand range
window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
//repeat above code block
resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
count = 0;
foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
}
After running some tests it seems that the range does get expanded correctly by the ExpandRange()
function, based on the formula in the paper, so the range is not the problem.
I think the problem is that the foreach
loop does not stop until all points in resultsList
are checked, which leads to more than k
points being added since the while
loop's condition won't be checked until after it finishes. A solution I came up with is to just break the loop once k
points are added to nnList
, effectively "truncating" the results, but I do not think it's that simple.
Is there any way to stop it from returning more than k
points?
A solution I came up with is to just break the loop once k points are added to nnList, effectively "truncating" the results, but I do not think it's that simple.
Yes, you can't do that because the extra points might be closer than existing points in the list.
A simple way to limit the count is just to sort and truncate the list after the foreach loop when you exceed k:
foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
// Add this code to take the first k points ordered by distance:
if(count > k)
nnList = nnList.OrderBy(p => p.Item2).Take(k).ToList();