Search code examples
jupyter-notebookcluster-analysisdbscan

Is clustering algorithm running although Jupyter Notebook Gateway timed out?


I am running the sklearn DBSCAN algorithm on a dataset with dimensionality 300000x50 in a Jupyter Notebook on AWS Sagemaker ("ml.t2.medium" compute instance). The dataset contains feature vectors with 1:s and 0:s.

Once I run the cell, an orange prompt in the upper right corner "Gateway Timeout" appears after a while. The icon disappears when you click on it providing no further information. The notebook is unresponsive until you restart the notebook instance.

I have tried different values for the parameters eps and min_samples to no avail.

db = DBSCAN(eps = 0.1, min_samples = 100).fit(transformed_vectors)

Does "Gateway Timeout" mean that the notebook kernel has crashed or can I expect any results by waiting?

So far the calculation has been running for about 2 hours.

Gateway Timeout


Solution

  • you could always pick a larger size for your notebook instance (ml.t2.medium is pretty small), but I think the better way would be to train your code a on a managed SageMaker instance. Sklearn is built-in on SageMaker, so all you have to do is bring your script, e.g.:

    from sagemaker.sklearn.estimator import SKLearn
    
    sklearn = SKLearn(
        entry_point="my_code.py",
        train_instance_type="ml.c4.xlarge",
        role=role,
        sagemaker_session=sagemaker_session)
    

    Here's a complete example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb