1) I am doing a PCA on 9570 columns giving it 12288 mb RAM in local mode(which means driver only) and it takes from 1.5 hours up to 2. This is the code (very simple):
System.out.println("level1\n");
VectorAssembler assemblerexp = new VectorAssembler()
.setInputCols(metincols)
.setOutputCol("intensity");
expoutput = assemblerexp.transform(expavgpeaks);
System.out.println("level2\n");
PCAModel pcaexp = new PCA()
.setInputCol("intensity")
.setOutputCol("pcaFeatures")
.setK(2)
.fit(expoutput);
System.out.println("level3\n");
So the time that it takes to print level3 is what it takes long (1.5 to 2 hours). Is it normal that it takes so long? I have tried different number partitions (2,4,6,8,50,500,10000) and for some of them also takes almost 2 hours while for others I get a Java heap space error. Also some pictures from Spark User Interface:
2) Is it also normal that I get different results with the PCA every time?
If you are setting RAM programmatically, it does not take effect, and a proper way would be to provide JVM arguments.