Search code examples
pythonapache-sparkpyspark

Is it possible to take arbitrary number of elements from array in PySpark?


My dataframe has two array columns. I want to take elements from first column whose indices are in the second column. For example, I have the following dataset

df = spark.createDataFrame(
   [
      {
         'text': ['0', '1', '2', '3', '4', '5']
         'indices': [0, 2, 4],
      },
   ]
)

So I want to have columns with value `['0', '2', '4].

Is it possible to achieve this without writingUDF?


Solution

  • You can try using the expr function with TRANSFORM and element_at to select elements from the first array based on the indices provided in the second array.

    E.g.:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import expr
    
    df = df.withColumn(
        "selected_text",
        expr("TRANSFORM(indices, i -> element_at(text, i))")
    )
    df.show()