I have a pyspark dataframe with 2 vector columns. When I show the dataframe in my notebook, it prints each vector like this: {"vectorType": "sparse", "length": 262144, "indices": [21641], "values": [1]}
When I print the schema, it shows up as VectorUDT.
I just need the "values" field value as a list or array. How can I save that as a new field? Doing "vector_field".values doesn't seem to work because pyspark thinks it's a String...
spark has a builtin ml function for vector-to-array conversion - vector_to_array
. you can simply pass the vector column to get the same as 1D array.
here's an example
from pyspark.ml.linalg import SparseVector, DenseVector
import pyspark.ml.functions as mfunc
data_ls = [
(SparseVector(3, [(0, 1.0), (2, 2.0)]),),
(DenseVector([3.0, 0.0, 1.0]),),
(SparseVector(3, [(1, 4.0)]),)
]
spark.createDataFrame(data_ls, ['vec']). \
withColumn('arr', mfunc.vector_to_array('vec')). \
show(truncate=False)
# +-------------------+---------------+
# |vec |arr |
# +-------------------+---------------+
# |(3,[0,2],[1.0,2.0])|[1.0, 0.0, 2.0]|
# |[3.0,0.0,1.0] |[3.0, 0.0, 1.0]|
# |(3,[1],[4.0]) |[0.0, 4.0, 0.0]|
# +-------------------+---------------+
# root
# |-- vec: vector (nullable = true)
# |-- arr: array (nullable = false)
# | |-- element: double (containsNull = false)