apache-spark machine-learning pyspark apache-spark-ml

How does Spark model treat vector column?

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?

Example1:

loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])

Example2:

vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])

What is the difference? Which one is better?

Solution

There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.

Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):

spark.version
#  u'2.3.1'

# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
                              [1, 40.4, -20.5, 12., 2.2],
                              [2, 28., -23.9, -2., -1.7],
                              [3, 29.5, -19.0, -0.5, -0.2],
                              [4, 32.8, -18.84, 1.5, 1.8]
                             ],
                              ["id","lat", "long", "other", "label"])

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline

loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])

model = pipeline.fit(df)
model.transform(df).show()

The result is:

+---+----+------+-----+-----+-------------+-----------------+
| id| lat|  long|other|label|          loc|         features|
+---+----+------+-----+-----+-------------+-----------------+
|  0|33.3| -17.5| 10.0|  0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
|  1|40.4| -20.5| 12.0|  2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
|  2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
|  3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
|  4|32.8|-18.84|  1.5|  1.8|[-18.84,32.8]|[-18.84,32.8,1.5]| 
+---+----+------+-----+-----+-------------+-----------------+

i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...