Search code examples
pythonpysparkvectorazure-databricks

Extract "values" from VectorUDT sparse vector in pyspark


I have a pyspark dataframe with 2 vector columns. When I show the dataframe in my notebook, it prints each vector like this: {"vectorType": "sparse", "length": 262144, "indices": [21641], "values": [1]}

When I print the schema, it shows up as VectorUDT.

I just need the "values" field value as a list or array. How can I save that as a new field? Doing "vector_field".values doesn't seem to work because pyspark thinks it's a String...


Solution

  • spark has a builtin ml function for vector-to-array conversion - vector_to_array. you can simply pass the vector column to get the same as 1D array.

    here's an example

    from pyspark.ml.linalg import SparseVector, DenseVector
    import pyspark.ml.functions as mfunc
    
    data_ls = [
        (SparseVector(3, [(0, 1.0), (2, 2.0)]),), 
        (DenseVector([3.0, 0.0, 1.0]),), 
        (SparseVector(3, [(1, 4.0)]),)
    ]
    
    spark.createDataFrame(data_ls, ['vec']). \
        withColumn('arr', mfunc.vector_to_array('vec')). \
        show(truncate=False)
    
    # +-------------------+---------------+
    # |vec                |arr            |
    # +-------------------+---------------+
    # |(3,[0,2],[1.0,2.0])|[1.0, 0.0, 2.0]|
    # |[3.0,0.0,1.0]      |[3.0, 0.0, 1.0]|
    # |(3,[1],[4.0])      |[0.0, 4.0, 0.0]|
    # +-------------------+---------------+
    
    # root
    #  |-- vec: vector (nullable = true)
    #  |-- arr: array (nullable = false)
    #  |    |-- element: double (containsNull = false)