Search code examples
scalaapache-sparkapache-spark-mllibpca

How to apply Principal Component Analysis for parquet file?


I have aparquet file which contain id,feature .id is int and feature is double.I want to apply pca algorithm to reduce dimensions.

parquet file

 val lData =  sqlContext.read.parquet("/usr/local/spark/dataset/model/data/user")
val vecData = lData.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache() 
val mat = new RowMatrix(vecData)
    val pc = mat.computePrincipalComponents(5)
    val projected = mat.multiply(pc)
val projectedRDD=projected.rows
 projectedRDD.saveAsTextFile("file:///usr/local/spark/dataset/PCA")

but this error appear

Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to java.lang.Double

how to solve it?


Solution

  • Salma,

    Using you file we can see that features is an Array[Double] :

    lData.printSchema
    
    root
     |-- id: integer (nullable = true)
     |-- features: array (nullable = true)
     |    |-- element: double (containsNull = true)
    

    For your code to work you can change your second line to :

    val vecData = lData.rdd.map(s => Vectors.dense(s.getInt(0),s.getAs[Seq[Double]](1):_*)).cache()
    

    Note : Usually I only use features in Vectors. Depending on what your id represents, maybe you don't need the s.getInt(0) part.