Search code examples
javaapache-sparkapache-spark-mllibpcaram

Spark MLlib: PCA on 9570 columns takes too long


1) I am doing a PCA on 9570 columns giving it 12288 mb RAM in local mode(which means driver only) and it takes from 1.5 hours up to 2. This is the code (very simple):

System.out.println("level1\n");
VectorAssembler assemblerexp = new VectorAssembler()
       .setInputCols(metincols)
       .setOutputCol("intensity");
expoutput = assemblerexp.transform(expavgpeaks);

System.out.println("level2\n");
PCAModel pcaexp = new PCA()
       .setInputCol("intensity")
       .setOutputCol("pcaFeatures")
       .setK(2)
       .fit(expoutput);

System.out.println("level3\n");

So the time that it takes to print level3 is what it takes long (1.5 to 2 hours). Is it normal that it takes so long? I have tried different number partitions (2,4,6,8,50,500,10000) and for some of them also takes almost 2 hours while for others I get a Java heap space error. Also some pictures from Spark User Interface:

Executors Jobs Stages environment

2) Is it also normal that I get different results with the PCA every time?


Solution

  • If you are setting RAM programmatically, it does not take effect, and a proper way would be to provide JVM arguments.