Search code examples
apache-spark-mlliblda

Understanding Spark MLlib LDA input format


I am trying to implement LDA using Spark MLlib.

But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

I followed http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

I understand the output format of this as explained here.

My use case is very simple, I have one data file with some sentences. I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().

My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?


Solution

  • First you need to convert your sentences into vectors.

    val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split("      ").toSeq)
    
    val hashingTF = new HashingTF()
    val tf: RDD[Vector] = hashingTF.transform(documents)
    val idf = new IDF().fit(tf)
    val tfidf: RDD[Vector] = idf.transform(tf)
    val corpus = tfidf.zipWithIndex.map(_.swap).cache()
    
     // Cluster the documents into three topics using LDA
    val ldaModel = new LDA().setK(3).run(corpus)
    

    Read more about TF_IDF vectorization here