Search code examples
arraysdataframeapache-sparkpysparkruntimeexception

PySpark equivalent of adding a constant array to a dataframe as column


The below code works in Scala-Spark.

scala> val ar = Array("oracle", "java")
ar: Array[String] = Array(oracle, java)

scala> df.withColumn("tags", lit(ar)).show(false)
+------+---+----------+----------+--------------+
|name  |age|role      |experience|tags          |
+------+---+----------+----------+--------------+
|John  |25 |Developer |2.56      |[oracle, java]|
|Scott |30 |Tester    |5.2       |[oracle, java]|
|Jim   |28 |DBA       |3.0       |[oracle, java]|
|Mike  |35 |Consultant|10.0      |[oracle, java]|
|Daniel|26 |Developer |3.2       |[oracle, java]|
|Paul  |29 |Tester    |3.6       |[oracle, java]|
|Peter |30 |Developer |6.5       |[oracle, java]|
+------+---+----------+----------+--------------+

How do I get the same behavior in PySpark? I tried the below, but it doesn't work and throws Java error.

from pyspark.sql.types import *

tag = ["oracle", "java"]
df2.withColumn("tags", lit(tag)).show()

: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [oracle, java]


Solution

  • You can import array from functions module

    >>> from pyspark.sql.types import *
    >>> from pyspark.sql.functions import array
    
    >>> tag=array(lit("oracle"),lit("java")
    >>> df2.withColumn("tags",tag).show()
    

    Tested below

    >>> from pyspark.sql.functions import array
    
    >>> tag=array(lit("oracle"),lit("java"))
    >>> 
    >>> ranked.withColumn("tag",tag).show()
    +------+--------------+----------+-----+----+----+--------------+               
    |gender|    ethinicity|first_name|count|rank|year|           tag|
    +------+--------------+----------+-----+----+----+--------------+
    |  MALE|      HISPANIC|    JAYDEN|  364|   1|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
    |  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
    |  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
    |  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
    |  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ANGEL|  236|  18|2012|[oracle, java]|
    |  MALE|      HISPANIC|     AIDEN|  235|  19|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    DANIEL|  232|  20|2012|[oracle, java]|
    +------+--------------+----------+-----+----+----+--------------+
    only showing top 20 rows