I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.
I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.
pipeline = Pipeline().setStages([
docAssembler,
tokenizer,
normalizer,
stemmer,
finisher,
stopwordsRemover,
# emptyRowsRemover
])
model = pipeline.fit(spark_df)
final_df = model.transform(spark_df)
final_df_limit.rdd.map(lambda x: (CalculateMinHash(x),)).toDF()
2nd step is failing as spark is not allowing us to store custom type value as a column. Value is an object of class MinHash.
Does anyone know how can i store Minhash objects in dataframes?
I don't think it might be possible to save python objects in DataFrames, but you can circumvent this in a couple of ways:
If that is not feasible because you still need some properties of the object, you might want to serialize it using Pickle, saving the serialized result as an encoded string. This forces you to de-serialize every time that you want to use the object.
final_df_limit.rdd.map(lambda x: base64.encodestring(pickle.dumps(CalculateMinHash(x),))).toDF()
An alternative might be to use the Spark MinHash implementation instead, but that might not suit all your requirements.