My dataframe has two array columns. I want to take elements from first column whose indices are in the second column. For example, I have the following dataset
df = spark.createDataFrame(
[
{
'text': ['0', '1', '2', '3', '4', '5']
'indices': [0, 2, 4],
},
]
)
So I want to have columns with value `['0', '2', '4].
Is it possible to achieve this without writingUDF?
You can try using the expr
function with TRANSFORM
and element_at
to select elements from the first array based on the indices provided in the second array.
E.g.:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
df = df.withColumn(
"selected_text",
expr("TRANSFORM(indices, i -> element_at(text, i))")
)
df.show()