Search code examples
amazon-web-servicespysparketlaws-glue

AWS Glue - pySpark: spliting a string column into a new integer array column


I am trying to perform an ETL job on AWS using Glue and pySpark, but unfortunately, I'm really new to this.

For the most part I dont have any problem working with glue dynamic dataframe to perform applymapping and some of the other transformations that I must perform. But I am facing a problem with a particular column that I must convert from string to integer array. In this column, value, we have the datatype set as string that is infact an array of integers converted to string and separated by space, for example a data entry in the value column looks like '111 222 333 444 555 666'. I must convert this column to be an integer array so that my data is transformed into '[111, 222, 333, 444, 555, 666]'.

How can I achieve this in AWS Glue and using pySpark? Any help is really appreciated.


Solution

  • Split the value column by space using split function and cast to array<int>.

    • (or) by using transform (From Spark-2.4) function and casting array elements as int.

    Example:

    df=spark.createDataFrame([('111 222 333 444 555 666',)],["value"])
    df.printSchema()
    #root
    # |-- value: string (nullable = true)
    
    #using split and cast as array<int>  
    df.withColumn("array_int",split(col("value"),"\\s+").cast("array<int>")).\
        show(10,False)
    
    #using transform function
    df.withColumn("array_int",expr("""transform(split(value,"\\\s+"), x -> int(x))""")).\
    show(10,False)
    #+-----------------------+------------------------------+
    #|value                  |array_int                     |
    #+-----------------------+------------------------------+
    #|111 222 333 444 555 666|[111, 222, 333, 444, 555, 666]|
    #+-----------------------+------------------------------+
    
    df.withColumn("array_int",split(col("value"),"\\s+").cast("array<int>")).printSchema()
    df.withColumn("array_int",expr("""transform(split(value,"\\\s+"), x -> int(x))""")).printSchema()    
    #root
    # |-- value: string (nullable = true)
    # |-- array_int: array (nullable = true)
    # |    |-- element: integer (containsNull = true)