I have a DataFrame with symptoms of a disease, I want to run FP Growt on the entire DataFrame. FP Growt wants an array as input and it works with this code:
dfFPG = (df.select(F.array(df["Gender"],
df["Polyuria"],
df["Polydipsia"],
df["Sudden weight loss"],
df["Weakness"],
df["Polyphagia"],
df["Genital rush"],
df["Visual blurring"],
df["Itching"]).alias("features")
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.3, minConfidence=0.2)
model = fpGrowth.fit(dfFPG)
model.freqItemsets.show(20,truncate=False)
the features list is longer and if I have to change the name of df I have to use find and replace. I know I can use F.col("Gender")
instead of df["Gender"]
but is there a way to put all the columns inside F.array()
in once and be able to exclude few of them like df["Age"]
?
Or, is there any other efficient way to prepare categorical features to FP Growt that I'm not aware of?
You can get all the column names using df.columns
and put them all into the array
:
import pyspark.sql.functions as F
dfFPG = df.select(F.array(*[c for c in df.columns if c not in ['col1', 'col2']]).alias("features"))