Search code examples
amazon-web-servicesxgboostamazon-sagemakerdistributed-trainingamz-sagemaker-distributed-training

Add Security groups in Amazon SageMaker for distributed training jobs


We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode). However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other. What are the minimum inbound/outbound rules (ports) that we need to specify for training jobs so that they can communicate?


Solution

  • setting up training in VPC including specifying security groups is documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-groups

    Normally you would allow all communication between the training nodes. To do this you specify the security group source and destination to the name of the security group itself, and allow all IPv4 traffic. If you want to figure out what ports are used, you could: 1/ define the permissive security group. 2/ Turn on VPC flow logs 3/ run training. 4/ examine VPC Flow logs 5/ update the security group only to the required ports.

    I must say restricting communication between the training nodes might be an extreme, so I would challenge the customer why it's really needed, as all nodes carry the same job, have the same IAM role, and are transiate by nature.