Search code examples
pythonapache-sparkpysparkrandom-forestbinning

How to pass a numeric feature having large number of unique values to Random Forest regression algorithm in PySpark MlLib?


I have a dataset which has a numeric feature column having large number of unique values (of the order of 10,000). I know that when we generate the model for Random Forest regression algorithm in PySpark, we pass a parameter maxBins which should be at least equal to maximum unique value in all features. So if I will pass 10,000 as the maxBins value then the algorithm will not be able take the load and it will either fail or go no forever. How can I pass such a feature to the model? I read at few places about binning the values into buckets and then passing those buckets to the model but I have no idea how to do that in PySpark. Can anyone show a sample code to do that? My current code is this:

    def parse(line):
        # line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
        return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])


    input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)



    parsed_data = (input
        .map(lambda line: line.split(","))
        .filter(lambda line: len(line) >1 )
        .map(parse))


    # Divide the input data in training and test set with 70%-30% ratio
    (train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])

    label_col = "x7"


# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
   for x in train_data_df.columns if x != label_col 
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)

label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))

If anyone can provide a sample code that how can I modify my code above to do the binning of the two large numeric value feature columns above it will be helpful.


Solution

  • we pass a parameter maxBins which should be at least equal to maximum unique value in all features.

    It is not true. It should be greater or equal to the maximum number of categories for categorical features. You still have to tune this parameter to obtain desired performance but otherwise there is nothing else to do here.