python apache-spark pyspark apache-spark-ml fpgrowth

Is there a way to put multiple columns in pyspark array function? (FP Growt prep)

I have a DataFrame with symptoms of a disease, I want to run FP Growt on the entire DataFrame. FP Growt wants an array as input and it works with this code:

dfFPG = (df.select(F.array(df["Gender"], 
                        df["Polyuria"], 
                        df["Polydipsia"], 
                        df["Sudden weight loss"], 
                        df["Weakness"], 
                        df["Polyphagia"],
                        df["Genital rush"],
                        df["Visual blurring"],
                        df["Itching"]).alias("features")

from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.3, minConfidence=0.2)
model = fpGrowth.fit(dfFPG)

model.freqItemsets.show(20,truncate=False)

the features list is longer and if I have to change the name of df I have to use find and replace. I know I can use F.col("Gender") instead of df["Gender"] but is there a way to put all the columns inside F.array() in once and be able to exclude few of them like df["Age"]? Or, is there any other efficient way to prepare categorical features to FP Growt that I'm not aware of?

Solution

You can get all the column names using df.columns and put them all into the array:

import pyspark.sql.functions as F

dfFPG = df.select(F.array(*[c for c in df.columns if c not in ['col1', 'col2']]).alias("features"))