Search code examples
arraysdataframeapache-sparkpysparkappend

Append column to an array in a PySpark dataframe


I have a Dataframe containing 2 columns

| VPN    | UPC             |
+--------+-----------------+
| 1      | [4,2]           |
| 2      | [1,2]           |
| null   | [4,7]           |

I need a result column with the values of VPN column (string) appended to the UPC column (array). The result should look something like this:

| result |
+--------+
| [4,2,1]|
| [1,2,2]|
| [4,7,] |

Solution

  • One option is to use concat + array. First use array to convert VPN column to an array type, then concatenate the two array columns with concat method:

    df = spark.createDataFrame([(1, [4, 2]), (2, [1, 2]), (None, [4, 7])], ['VPN', 'UPC'])
    
    df.show()
    +----+------+
    | VPN|   UPC|
    +----+------+
    |   1|[4, 2]|
    |   2|[1, 2]|
    |null|[4, 7]|
    +----+------+
    
    df.selectExpr('concat(UPC, array(VPN)) as result').show()
    +---------+
    |   result|
    +---------+
    |[4, 2, 1]|
    |[1, 2, 2]|
    |  [4, 7,]|
    +---------+
    

    Or more pythonic:

    from pyspark.sql.functions import array, concat
    
    df.select(concat('UPC', array('VPN')).alias('result')).show()
    +---------+
    |   result|
    +---------+
    |[4, 2, 1]|
    |[1, 2, 2]|
    |  [4, 7,]|
    +---------+