I have a dataset
which has a numeric feature
column having large number of unique values (of the order of 10,000
). I know that when we generate the model for Random Forest regression
algorithm in PySpark
, we pass a parameter maxBins
which should be at least equal to maximum unique value in all features. So if I will pass 10,000
as the maxBins
value then the algorithm will not be able take the load and it will either fail or go no forever. How can I pass such a feature to the model? I read at few places about binning
the values into buckets and then passing those buckets to the model but I have no idea how to do that in PySpark. Can anyone show a sample code to do that? My current code is this:
def parse(line):
# line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])
input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)
parsed_data = (input
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(parse))
# Divide the input data in training and test set with 70%-30% ratio
(train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])
label_col = "x7"
# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
for x in train_data_df.columns if x != label_col
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)
label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
If anyone can provide a sample code that how can I modify my code above to do the binning of the two large numeric value feature columns above it will be helpful.
we pass a parameter maxBins which should be at least equal to maximum unique value in all features.
It is not true. It should be greater or equal to the maximum number of categories for categorical features. You still have to tune this parameter to obtain desired performance but otherwise there is nothing else to do here.