I'm trying to implement LDA using Spark and got this error. I'm totally new to Spark, so any help is appreciated.
[root@sandbox ~]# spark-submit ./lda.py
Traceback (most recent call last):
File "/root/./lda.py", line 3, in <module>
from pyspark.mllib.clustering import LDA, LDAModel
ImportError: cannot import name LDA
Here is the code:
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import numpy
sc = SparkContext(appName="PythonLDA")
data = sc.textFile("/tutorial/input/askreddit20150801.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(3):
print("Topic " + str(topic) + ":")
for word in range(0, ldaModel.vocabSize()):
print(" " + str(topics[word][topic]))
# Save and load model
model.save(sc, "myModelPath")
sameModel = LDAModel.load(sc, "myModelPath")
When I tried to install pyspark.mllib.clustering:
[root@sandbox ~]# pip install spark.mllib.clustering
Collecting spark.mllib.clustering
/usr/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Could not find a version that satisfies the requirement spark.mllib.clustering (from versions: )
No matching distribution found for spark.mllib.clustering
PySpark wrapper for LDA has been introduced in Spark 1.5.0. Assuming your installation hasn't been corrupted you probably use Spark <= 1.4.x.