I am running the sklearn DBSCAN algorithm on a dataset with dimensionality 300000x50 in a Jupyter Notebook on AWS Sagemaker ("ml.t2.medium" compute instance). The dataset contains feature vectors with 1:s and 0:s.
Once I run the cell, an orange prompt in the upper right corner "Gateway Timeout" appears after a while. The icon disappears when you click on it providing no further information. The notebook is unresponsive until you restart the notebook instance.
I have tried different values for the parameters eps and min_samples to no avail.
db = DBSCAN(eps = 0.1, min_samples = 100).fit(transformed_vectors)
Does "Gateway Timeout" mean that the notebook kernel has crashed or can I expect any results by waiting?
So far the calculation has been running for about 2 hours.
you could always pick a larger size for your notebook instance (ml.t2.medium is pretty small), but I think the better way would be to train your code a on a managed SageMaker instance. Sklearn is built-in on SageMaker, so all you have to do is bring your script, e.g.:
from sagemaker.sklearn.estimator import SKLearn
sklearn = SKLearn(
entry_point="my_code.py",
train_instance_type="ml.c4.xlarge",
role=role,
sagemaker_session=sagemaker_session)
Here's a complete example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb