RobustScaler in PySpark

I would like to use a RobustScaler for preprocessing data. In sklearn it can be found in

sklearn.preprocessing.RobustScaler

. However, I am using pyspark, so I tried to import it with:

 from pyspark.ml.feature import RobustScaler

However, I receive the following error:

ImportError: cannot import name 'RobustScaler' from 'pyspark.ml.feature'

As pault pointed out, RobustScaler is implemented only in pyspark 3. I am trying to implement it as:

class PySpark_RobustScaler(Pipeline):
    def __init__(self):
        pass

    def fit(self, df):
        return self

    def transform(self, df):
        self._df = df
        for col_name in self._df.columns:
            q1, q2, q3 = self._df.approxQuantile(col_name, [0.25, 0.5, 0.75], 0.00)
            self._df = self._df.withColumn(col_name, 2.0*(sf.col(col_name)-q2)/(q3-q1))
        return self._df

arr = np.array(
            [[ 1., -2.,  2.],
            [ -2.,  1.,  3.],
            [ 4.,  1., -2.]]
          )

rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df_sprk = rdd2.toDF(["A", "B", "C"])
df_pd = pd.DataFrame(arr, columns=list('ABC'))

PySpark_RobustScaler().fit(df_sprk).transform(df_sprk).show()
print(RobustScaler().fit(df_pd).transform(df_pd))

However I have found that to obtain the same result of sklearn I have to multiply the result by 2. Furthermore, I am worried that if a column has many values close to zero, the interquartile range q3-q1 could become too small and let the result diverge, creating null values.

Does anyone have any suggestions on how to improve it?

Solution

This feature has been released in recent pyspark versions.