Search code examples

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.

from sklearn.feature_selection import RFECV,RFE

logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe =,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]

However, I could not find any article which could show how can I perform recursive feature selection in pyspark.

I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.

Could please someone help me achieve this in pyspark


  • We can try following feature selection methods in pyspark

    • Chi-Squared selector
    • Randomforest selector
