I am trying to perform an ETL job on AWS using Glue and pySpark, but unfortunately, I'm really new to this.
For the most part I dont have any problem working with glue dynamic dataframe to perform applymapping and some of the other transformations that I must perform. But I am facing a problem with a particular column that I must convert from string to integer array. In this column, value
, we have the datatype set as string that is infact an array of integers converted to string and separated by space, for example a data entry in the value
column looks like '111 222 333 444 555 666'
. I must convert this column to be an integer array so that my data is transformed into '[111, 222, 333, 444, 555, 666]'
.
How can I achieve this in AWS Glue and using pySpark? Any help is really appreciated.
Split the value
column by space
using split
function and cast to array<int>
.
transform
(From Spark-2.4)
function and casting array elements as int.Example:
df=spark.createDataFrame([('111 222 333 444 555 666',)],["value"])
df.printSchema()
#root
# |-- value: string (nullable = true)
#using split and cast as array<int>
df.withColumn("array_int",split(col("value"),"\\s+").cast("array<int>")).\
show(10,False)
#using transform function
df.withColumn("array_int",expr("""transform(split(value,"\\\s+"), x -> int(x))""")).\
show(10,False)
#+-----------------------+------------------------------+
#|value |array_int |
#+-----------------------+------------------------------+
#|111 222 333 444 555 666|[111, 222, 333, 444, 555, 666]|
#+-----------------------+------------------------------+
df.withColumn("array_int",split(col("value"),"\\s+").cast("array<int>")).printSchema()
df.withColumn("array_int",expr("""transform(split(value,"\\\s+"), x -> int(x))""")).printSchema()
#root
# |-- value: string (nullable = true)
# |-- array_int: array (nullable = true)
# | |-- element: integer (containsNull = true)