python amazon-web-services performance amazon-sagemaker lda

SageMaker Hyperparameter Tuning for LDA, clarifying feature_dim

I'm trying to run a HyperparameterTuner on an Estimator for an LDA model in a SageMaker notebook using mxnet but am running into errors related to the feature_dim hyperparameter in my code. I believe this is related to the differing dimensions of the train and test datasets but I'm not 100% certain if this is the case or how to fix it.

Estimator Code

[note that I'm setting the feature_dim to the training dataset's dimensions]

vocabulary_size = doc_term_matrix_train.shape[1]

lda = sagemaker.estimator.Estimator(
        container,
        role,
        output_path="s3://{}/{}/output".format(bucket, prefix),
        train_instance_count=1,
        train_instance_type="ml.c4.2xlarge",
        sagemaker_session=session
        )

lda.set_hyperparameters(
    mini_batch_size=40
    feature_dim=vocabulary_size,
    )

Hyperparameter Tuning Job

#s3_input_train and s3_input_test hold doc_term matrices of the test/train corpus 

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_test ='s3://{}/{}/test/'.format(bucket, prefix)
data_channels = {'train': s3_input_train, 'test': s3_input_test}

hyperparameter_ranges = {
    "alpha0": ContinuousParameter(0.1, 1.5, scaling_type="Logarithmic"),
    "num_topics":IntegerParameter(3, 10)}

# Configure HyperparameterTuner
my_tuner = HyperparameterTuner(estimator=lda,  
                               objective_metric_name='test:pwll',
                               hyperparameter_ranges=hyperparameter_ranges,
                               max_jobs=5,
                               max_parallel_jobs=2)

# Start hyperparameter tuning job
my_tuner.fit(data_channels, job_name='run-3', include_cls_metadata=False)

Cloudwatch Logs

When I run the above, the tunings fail and when I look into Cloudwatch to see the logs, the error is typically:

[01/19/2022 19:42:22 ERROR 140234465695552] Algorithm Error: index 11873 is out of bounds for axis 1 with size 11873 (caused by IndexError)

I replicated the above because 11873 is the number of features in my test dataset, so I think there's a connection but I'm not sure exactly what's going on. When I try "11873" as the value for feature_dim, the error complains that the data has 32465 features (corresponding to the training set). Summing the two values also gives the following error:

[01/20/2022 13:44:01 ERROR 140125082621760] Customer Error: The supplied feature_dim parameter does not have the same dimensionality of the data. (feature_dim) 44338 != 32465 (data).

Lastly, one of the last logs in Cloudwatch reports the following, suggesting that "all data" is being fit into a matrix with the dimensions of the test data:

[01/20/2022 14:49:52 INFO 140411440904000] Loaded all data into matrix with shape: (11, 11873)

How do I define feature_dim given the test and training data sets?

Solution

I have resolved this issue. My problem was that I was splitting the data into test and train BEFORE converting the data into doc-term matrices, which resulted in test and train datasets of different dimensionality, which threw off SageMaker's algorithm. Once I convereted all of the input data into a doc-term matrix, and THEN split it into test and train, the hyperparameter optimization operation completed.