python csv apache-spark pyspark apache-spark-ml

How to split columns into label and features in pyspark?

I am studying PySpark. From https://spark.apache.org/docs/2.2.0/ml-pipeline.html, there is an example:

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
......

From here, you can see this is a very small dataset, and all the features are put together and have a common name: features.

But usually we read data from csv file like this:

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/home/feng/Downloads/datatry.csv",header=True)

If my data has 5 columns: c1, c2, c3, c4, c5. Let's say c5 is the label column and the other 4 columns are the features. So, how to transfer a csv format to the above format so that I can keep working? Or, is there another which does not need to do this?

Thanks

Solution

VectorAssembler can be used to transform a given list of columns to a single vector column.

Example usage:

assembler = VectorAssembler(
    inputCols=["c1", "c2", "c3", "c4"],
    outputCol="features")

output = assembler.transform(df)

This requires all columns used to be of numberic, boolean or vector types. If you have string columns it is necessary to use an additional transformer: StringIndexer. For an overview of all avaiable transformers, see the documentation.

Note that when using multiple tranformers consecutivly on the same data, it's simpler to use a Pipeline.