I've created a PipelineModel
for doing LDA in Spark 2.0 (via PySpark API):
def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
any fitting until invoked by the caller.
minDF: minimum number of documents word is present in corpus
minTF: minimum number of times word is found in a document
pattern: regular expression to split words
pipeline: class pyspark.ml.PipelineModel
reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
return pipeline
I want to calculate the perplexity on a dataset using the trained model with the LDAModel.logPerplexity()
method, so I tried running the following:
training = get_20_newsgroups_data(test_or_train='test')
pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
model = pipeline.fit(training) # train model on training data
testing = get_20_newsgroups_data(test_or_train='test')
perplexity = model.logPerplexity(testing)
This just results in the following AttributeError
'PipelineModel' object has no attribute 'logPerplexity'
I understand why this error happens, since the logPerplexity
method belongs to LDAModel
, not PipelineModel
, but I am wondering if there is a way to access the method from that stage.
All transformers in the pipeline are stored in stages
property. Extract stages
, take the last one, and you're ready to go: