I'm trying to run a HyperparameterTuner on an Estimator for an LDA model in a SageMaker notebook using mxnet but am running into errors related to the feature_dim hyperparameter in my code. I believe this is related to the differing dimensions of the train and test datasets but I'm not 100% certain if this is the case or how to fix it.
[note that I'm setting the feature_dim to the training dataset's dimensions]
vocabulary_size = doc_term_matrix_train.shape[1]
lda = sagemaker.estimator.Estimator(
container,
role,
output_path="s3://{}/{}/output".format(bucket, prefix),
train_instance_count=1,
train_instance_type="ml.c4.2xlarge",
sagemaker_session=session
)
lda.set_hyperparameters(
mini_batch_size=40
feature_dim=vocabulary_size,
)
#s3_input_train and s3_input_test hold doc_term matrices of the test/train corpus
s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_test ='s3://{}/{}/test/'.format(bucket, prefix)
data_channels = {'train': s3_input_train, 'test': s3_input_test}
hyperparameter_ranges = {
"alpha0": ContinuousParameter(0.1, 1.5, scaling_type="Logarithmic"),
"num_topics":IntegerParameter(3, 10)}
# Configure HyperparameterTuner
my_tuner = HyperparameterTuner(estimator=lda,
objective_metric_name='test:pwll',
hyperparameter_ranges=hyperparameter_ranges,
max_jobs=5,
max_parallel_jobs=2)
# Start hyperparameter tuning job
my_tuner.fit(data_channels, job_name='run-3', include_cls_metadata=False)
When I run the above, the tunings fail and when I look into Cloudwatch to see the logs, the error is typically:
[01/19/2022 19:42:22 ERROR 140234465695552] Algorithm Error: index 11873 is out of bounds for axis 1 with size 11873 (caused by IndexError)
I replicated the above because 11873 is the number of features in my test dataset, so I think there's a connection but I'm not sure exactly what's going on. When I try "11873" as the value for feature_dim, the error complains that the data has 32465 features (corresponding to the training set). Summing the two values also gives the following error:
[01/20/2022 13:44:01 ERROR 140125082621760] Customer Error: The supplied feature_dim parameter does not have the same dimensionality of the data. (feature_dim) 44338 != 32465 (data).
Lastly, one of the last logs in Cloudwatch reports the following, suggesting that "all data" is being fit into a matrix with the dimensions of the test data:
[01/20/2022 14:49:52 INFO 140411440904000] Loaded all data into matrix with shape: (11, 11873)
How do I define feature_dim given the test and training data sets?
I have resolved this issue. My problem was that I was splitting the data into test and train BEFORE converting the data into doc-term matrices, which resulted in test and train datasets of different dimensionality, which threw off SageMaker's algorithm. Once I convereted all of the input data into a doc-term matrix, and THEN split it into test and train, the hyperparameter optimization operation completed.