arrays apache-spark pyspark apache-spark-sql type-conversion

Convert string list to array type

I have a dataframe with a column of string datatype, but the actual representation is array type.

import pyspark
from pyspark.sql import Row
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
                              Row(item='chicken',geography=['a','hyd','c']),
                              Row(item='rice',geography=['a','b','c','blr']),
                              Row(item='soup',geography=['a','kol','simla']),
                              Row(item='pav',geography=['a','del']),
                              Row(item='kachori',geography=['a','guj']),
                              Row(item='fries',geography=['a','chen']),
                              Row(item='noodles',geography=['a','mum'])])
item.show()
# +-------+-------------------+
# |   item|          geography|
# +-------+-------------------+
# |   fish|[london, a, b, hyd]|
# |chicken|        [a, hyd, c]|
# |   rice|     [a, b, c, blr]|
# |   soup|    [a, kol, simla]|
# |    pav|           [a, del]|
# |kachori|           [a, guj]|
# |  fries|          [a, chen]|
# |noodles|           [a, mum]|
# +-------+-------------------+

print(item.printSchema())
#  root
#  |-- item: string (nullable = true)
#  |-- geography: string (nullable = true)

How to convert the geography column in the above dataset to array type?

Solution

Use split

option 1

  new=    (item.withColumn('geography',split(regexp_replace('geography','[^\w\,]',''),'\,'))).printSchema()

option 2

new1 =(item.withColumn('geography',col('geography').cast('string'))
    .withColumn('geography',split('geography','\,'))).printSchema()