Search code examples
dataframepysparklsh

Is it possible to store custom class object in Spark Data Frame as a column value?


I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.

I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.

  1. Run Spark ML pipeline for converting text into tokens.

pipeline = Pipeline().setStages([
        docAssembler,
        tokenizer,
        normalizer,
        stemmer,
        finisher,
        stopwordsRemover,
       # emptyRowsRemover
    ])
model = pipeline.fit(spark_df)
final_df = model.transform(spark_df)

  1. For each document, get MinHash value using datasketch(https://github.com/ekzhu/datasketch/) library and store it as a new column.
final_df_limit.rdd.map(lambda x: (CalculateMinHash(x),)).toDF()

2nd step is failing as spark is not allowing us to store custom type value as a column. Value is an object of class MinHash.

Does anyone know how can i store Minhash objects in dataframes?


Solution

  • I don't think it might be possible to save python objects in DataFrames, but you can circumvent this in a couple of ways:

    • Store the result instead of the object (not sure about how MinHash works, but if the value is numerical/string, it should be easy to extract it from the class object).
    • If that is not feasible because you still need some properties of the object, you might want to serialize it using Pickle, saving the serialized result as an encoded string. This forces you to de-serialize every time that you want to use the object.

      final_df_limit.rdd.map(lambda x: base64.encodestring(pickle.dumps(CalculateMinHash(x),))).toDF()

    • An alternative might be to use the Spark MinHash implementation instead, but that might not suit all your requirements.