Search code examples
pythonapache-sparkpysparkapache-spark-mlfpgrowth

Is there a way to put multiple columns in pyspark array function? (FP Growt prep)


I have a DataFrame with symptoms of a disease, I want to run FP Growt on the entire DataFrame. FP Growt wants an array as input and it works with this code:

dfFPG = (df.select(F.array(df["Gender"], 
                        df["Polyuria"], 
                        df["Polydipsia"], 
                        df["Sudden weight loss"], 
                        df["Weakness"], 
                        df["Polyphagia"],
                        df["Genital rush"],
                        df["Visual blurring"],
                        df["Itching"]).alias("features")

from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.3, minConfidence=0.2)
model = fpGrowth.fit(dfFPG)

model.freqItemsets.show(20,truncate=False)

the features list is longer and if I have to change the name of df I have to use find and replace. I know I can use F.col("Gender") instead of df["Gender"] but is there a way to put all the columns inside F.array() in once and be able to exclude few of them like df["Age"]? Or, is there any other efficient way to prepare categorical features to FP Growt that I'm not aware of?


Solution

  • You can get all the column names using df.columns and put them all into the array:

    import pyspark.sql.functions as F
    
    dfFPG = df.select(F.array(*[c for c in df.columns if c not in ['col1', 'col2']]).alias("features"))