I'm trying to run PCA on a matrix that contains n columns of unlabeled doubles. My code is:
SparkSession spark = SparkSession
.builder()
.appName("JavaPCAExample")
.getOrCreate();
Dataset<Row> data = spark.read().format("csv")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "False")
.load("testInput/matrix.csv");
PCAModel pca = new PCA()
// .setInputCol("features")
// .setOutputCol("pcaFeatures")
.setK(3)
.fit(data);
Dataset<Row> result = pca.transform(data).select("pcaFeatures");
result.show(true);
spark.stop();
Running this results in a "java.lang.IllegalArgumentException: Field "features" does not exist." exception. I've found posts: How to merge multiple feature vectors in DataFrame?
How to work with Java Apache Spark MLlib when DataFrame has columns?
Which led me to the VectorAssembler docs here: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
In each of those examples the labeled column headers are manually being added as features. I haven't been able to figure out how to use the VectorAssembler to turn all of my n unlabeled columns into features. Any insight would be appreciated. Thanks
found the .columns() function
SparkSession spark = SparkSession
.builder()
.appName("JavaPCAExample")
.getOrCreate();
Dataset<Row> data = spark.read().format("csv")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "False")
.load("testInput/matrix.csv");
VectorAssembler assembler = new VectorAssembler()
.setInputCols(data.columns())
.setOutputCol("features");
Dataset<Row> output = assembler.transform(data);
PCAModel pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(5)
.fit(output);