Search code examples
pythonapache-sparkpysparkapache-spark-sqlapache-spark-ml

How to split Vector into columns - using PySpark


Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT.

An Example:

word    |  vector
assert  | [435,323,324,212...]

And I want to get this:

word   |  v1 | v2  | v3 | v4 | v5 | v6 ......
assert | 435 | 5435| 698| 356|....

Question:

How can I split a column with vectors in several columns for each dimension using PySpark ?

Thanks in advance


Solution

  • Spark >= 3.0.0

    Since Spark 3.0.0 this can be done without using UDF.

    from pyspark.ml.functions import vector_to_array
    
    (df
        .withColumn("xs", vector_to_array("vector")))
        .select(["word"] + [col("xs")[i] for i in range(3)]))
    
    ## +-------+-----+-----+-----+
    ## |   word|xs[0]|xs[1]|xs[2]|
    ## +-------+-----+-----+-----+
    ## | assert|  1.0|  2.0|  3.0|
    ## |require|  0.0|  2.0|  0.0|
    ## +-------+-----+-----+-----+
    

    Spark < 3.0.0

    One possible approach is to convert to and from RDD:

    from pyspark.ml.linalg import Vectors
    
    df = sc.parallelize([
        ("assert", Vectors.dense([1, 2, 3])),
        ("require", Vectors.sparse(3, {1: 2}))
    ]).toDF(["word", "vector"])
    
    def extract(row):
        return (row.word, ) + tuple(row.vector.toArray().tolist())
    
    df.rdd.map(extract).toDF(["word"])  # Vector values will be named _2, _3, ...
    
    ## +-------+---+---+---+
    ## |   word| _2| _3| _4|
    ## +-------+---+---+---+
    ## | assert|1.0|2.0|3.0|
    ## |require|0.0|2.0|0.0|
    ## +-------+---+---+---+
    

    An alternative solution would be to create an UDF:

    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import ArrayType, DoubleType
    
    def to_array(col):
        def to_array_(v):
            return v.toArray().tolist()
        # Important: asNondeterministic requires Spark 2.3 or later
        # It can be safely removed i.e.
        # return udf(to_array_, ArrayType(DoubleType()))(col)
        # but at the cost of decreased performance
        return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
    
    (df
        .withColumn("xs", to_array(col("vector")))
        .select(["word"] + [col("xs")[i] for i in range(3)]))
    
    ## +-------+-----+-----+-----+
    ## |   word|xs[0]|xs[1]|xs[2]|
    ## +-------+-----+-----+-----+
    ## | assert|  1.0|  2.0|  3.0|
    ## |require|  0.0|  2.0|  0.0|
    ## +-------+-----+-----+-----+
    

    For Scala equivalent see Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)].