python python-2.7 pyspark apache-spark-sql spss-modeler

Using df.withColumn() on multiple columns

I am working with python and pyspark to extend the SPSS Modeler.

I want to manipulate ~5000 columns and therefore use the following construct:

for target in targets:
    inputData = inputData.withColumn(target+appendString, function(target))

This is very slow. Is there a more efficent way to do this for all target columns?

targets contains a list of column names to be used, function(target) is a placeholder where I do stuff with different columns like adding and dividing.

I would be happy if you could help me :)

pandayo

Solution

try this :

inputData.select(
    '*', 
    *(function(target).alias(target+appendString) for target in targets)
)