apache-spark machine-learning regression apache-spark-mllib vowpalwabbit

Spark MLLib: convert arbitrary, sparse features to a fixed length Vector

We are converting an online machine learning Linear Regression model from Vowpal Wabbit to Spark MLLib. Vowpal Wabbit allows for arbitrary, sparse features by training the model on weights backed by a linked list, whereas Spark MLLib trains on an MLLib Vector of weights which is backed by a fixed length array.

The features we pass to the model are arbitrary strings and not categories. Vowpal Wabbit maps these features to weight values of 1.0 using a hash. We can do the same mapping in MLLib, but are limited to a fixed length array. Is it possible to train such a model in MLLib where the size of the feature space is not known?

Solution

FeatureHasher will do this and is the same hash function as Vowpal Wabbit (MurmurHash3). VowpalWabbit and FeatureHasher both have a default number of features of 2^18

https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/ml/feature/FeatureHasher.html