I've got an array column in a pyspark dataframe, and I want to find the index of the first positive number in each array. The data looks like this:
id | arr |
---|---|
Cell 1 | -1, -1, -1, -1 |
Cell 2 | -1, -1, 5, -1 |
Cell 3 | -1, 3, -1, -1 |
I want to get an output similar to this:
id | arr | first_positive_element_index |
---|---|---|
Cell 1 | -1, -1, -1, -1 | null |
Cell 2 | -1, -1, 5, -1 | 2 |
Cell 3 | -1, 3, -1, -1 | 1 |
I can do this with a UDF, but the data is quite large making this approach extremely slow. I will prefer if there's a much better way around this without the use of a UDF.
Note: all non-positive numbers are -1
You can use the expr
with array_position
:
df_pos = df.select(
'id', 'arr',
func.explode('arr').alias('arr_explode_value')
).filter(
func.col('arr_explode_value')>=0
).withColumn(
'pos', func.expr('array_position(arr, arr_explode_value)')-1
).groupBy(
'id'
).agg(
func.min('pos').alias('pos')
)
df_pos.show(10, False)
+------+---+
|id |pos|
+------+---+
|Cell 2|2 |
|Cell 3|1 |
+------+---+
You can create a dataframe to
The rest is to join the reference table back to the dataframe.
df.select('id', 'arr').join(df_pos.select('id', 'pos'), on=['id'], how='left')
Edit 1:
If you don't want to use explode
because of the long array, you can use the transform
and array_position
:
df.select(
'id', 'arr',
func.transform(func.col('arr'), lambda value: func.when(value>=0, 1).otherwise(0)).alias('transformed_arr')
).withColumn(
'pos', func.array_position('transformed_arr', 1)-1
).show(
10, False
)
+------+----------------+---------------+---+
|id |arr |transformed_arr|pos|
+------+----------------+---------------+---+
|Cell 1|[-1, -1, -1, -1]|[0, 0, 0, 0] |-1 |
|Cell 2|[-1, -1, 5, -1] |[0, 0, 1, 0] |2 |
|Cell 3|[-1, 3, -1, -1] |[0, 1, 0, 0] |1 |
+------+----------------+---------------+---+
Since the arr
column is array type, you can use transform
to apply function on the element.