I want to use the Approximate Nearest Neighbor Search provide by Spark MLlib (ref.) but I'm super lost because I didn't find an example or something to guide me. The only info provided for the previous link is:
Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.
Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
A distance column will be added to the output dataset to show the true distance between each output row and the searched key.
Note: Approximate nearest neighbor search will return fewer than k rows when there are not enough candidates in the hash bucket.
Does anybody know how to use the Approximate Nearest Neighbor Search provide by Spark MLlib?
Here you can find an example https://spark.apache.org/docs/2.1.0/ml-features.html#lsh-algorithms :
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors
val dfA = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 1.0)),
(1, Vectors.dense(1.0, -1.0)),
(2, Vectors.dense(-1.0, -1.0)),
(3, Vectors.dense(-1.0, 1.0))
)).toDF("id", "keys")
val dfB = spark.createDataFrame(Seq(
(4, Vectors.dense(1.0, 0.0)),
(5, Vectors.dense(-1.0, 0.0)),
(6, Vectors.dense(0.0, 1.0)),
(7, Vectors.dense(0.0, -1.0))
)).toDF("id", "keys")
val key = Vectors.dense(1.0, 0.0)
val brp = new BucketedRandomProjectionLSH()
.setBucketLength(2.0)
.setNumHashTables(3)
.setInputCol("keys")
.setOutputCol("values")
val model = brp.fit(dfA)
// Feature Transformation
model.transform(dfA).show()
// Cache the transformed columns
val transformedA = model.transform(dfA).cache()
val transformedB = model.transform(dfB).cache()
// Approximate similarity join
model.approxSimilarityJoin(dfA, dfB, 1.5).show()
model.approxSimilarityJoin(transformedA, transformedB, 1.5).show()
// Self Join
model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < datasetB.id").show()
// Approximate nearest neighbor search
model.approxNearestNeighbors(dfA, key, 2).show()
model.approxNearestNeighbors(transformedA, key, 2).show()
The code above is from spark documentation.