Search code examples
pyspark

pyspark make new column with single Row element of other column?


I trained a xgb classifier model in pyspark and transformed some data via

outp = model.transform(inp)

now outp contains a column 'probability' with row entries such as

Row(probability=DenseVector([0.99,0.01]))

I'd like to add a new column to outp, that contains rows of floats from the second probability component of the Row elements mentioned above (so e.g. just 0.01 instead of Row(...) ). What is the correct syntax to do that?

I tried

outp = outp.select("*",(col('probability')[:,1]).alias('prob'))

expecting that the first element of each row in the column will be selected. But that syntax produces an error.


Solution

  • Using the suggestion from the comment by samkart, I changed the syntax to:

    outp = outp.select("*",(vector_to_array('probability').getItem(1)).alias('prob'))
    

    and now it does what I wanted.