algorithm cluster-analysis computational-geometry hamming-distance

Find all k-nearest neighbors

Problem:

I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to be the best possible but closer the better.

Thoughts about the problem

I have a bad feeling this is a non trivial problem. I have read many papers and algorithms, however most of them has poor result in high dimension and it works when dimension is less than 5. For example this paper suggests an efficient algorithm, but its constant is related to dimension exponentially.

Currently, I am investigating on how can I reduce dimension in a way that hamming distance is preserved or can be computed.

Another option is locality sensitive hashing, Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Any Help? which option do you prefer?

Solution

One of the previously asked questions has some good discussions, so you can refer to that,

Nearest neighbors in high-dimensional data?

Other than this, you can also look at,

http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf

Few papers which analyze different approaches,

http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf

https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf