Scheduler-worker cluster without port forwarding

Hello Satckoverflow!

TLDR I would like to recreate https://github.com/KorayGocmen/scheduler-worker-grpc without port forwarding on the worker.

I am trying to build a competitive programming judge server for evaluation of submissions as a project for my school where I teach programming to kids.

Because the evaluation is computationally heavy I would like to have multiple worker nodes. The scheduler would receive submissions and hand them out to the worker nodes. For ease of worker deployment ( as it will be often changing ) I would like the worker to be able to subscribe to the scheduler and thus become a worker and receive jobs.

The workers may not be on the same network as the scheduler + the worker resides in a VM ( maybe later will be ported to docker but currently there are issues with it ).

The scheduler should be able to know resource usage of the worker, send different types of jobs to the worker and receive a stream of results.

I am currently thinking of using grpc to address my requirements of communication between workers and the scheduler.

I could create multiple scheduler service methods like:

register worker, receive a stream of jobs
stream job results, receive nothing
stream worker state periodically, receive nothing

However I would prefer the following but idk whether it is possible:

The scheduler GRPC api:
- register a worker ( making the worker GRPC api available to the scheduler )
The worker GRPC api:
- start a job ( returns stream of job status )
- cancel a job ???
- get resource usage

The worker should unregister automatically if the connection is lost.

So my question is... is it possible to create a grpc worker api that can be registered to the scheduler for later use if the worker is behind a NAT without port forwarding?

Additional possibly unnecessary information:

Making matters worse I have multiple radically different types of jobs ( streaming an interactive console, executing code against prepared testcases ). I may just create different workers for different jobs.

Sometimes the jobs involve having large files on the local filesystem ( up to 500 MB ) that are usually kept near the scheduler therefore I would like to send the job to a worker which already has the specific files downloaded from the scheduler. Otherwise download the large files on one of the workers. Having all files at the same time on the worker would take more than 20 GB therefore I would like to avoid it.

A worker can run multiple jobs ( up to 16 ) at the same time.

I am writing the system in go.

Solution

As long as only the workers initiate the connections you don't have to worry about NAT. gRPC supports streaming in either direction (or both). This means that all of your requirements can be implemented using just one server on the scheduler; there is no need for the scheduler to connect back to the workers.

Given your description your service could look something like this:

syntax = "proto3";

import "google/protobuf/empty.proto";

service Scheduler {
    rpc GetJobs(GetJobsRequest) returns (stream GetJobsResponse) {}
    rpc ReportWorkerStatus(stream ReportWorkerStatusRequest) returns (google.protobuf.Empty) {}
    rpc ReportJobStatus(stream JobStatus) returns (stream JobAction) {}
}

enum JobType {
    JOB_TYPE_UNSPECIFIED = 0;
    JOB_TYPE_CONSOLE = 1;
    JOB_TYPE_EXEC = 2;
}

message GetJobsRequest {
    // List of job types this worker is willing to accept.
    repeated JobType types = 1;
}

message GetJobsResponse {
    string jobId = 0;
    JobType type = 1;

    string fileName = 2;
    bytes fileContent = 3;
    // etc.
}

message ReportWorkerStatusRequest {
    float cpuLoad = 0;
    uint64 availableDiskSpace = 1;
    uint64 availableMemory = 2;
    // etc.

    // List of filenames or file hashes, or whatever else you need to precisely
    // report the presence of files.
    repeated string haveFiles = 2;
}

Much of this is a matter of preference (you can use oneof instead of enums, for instance), but hopefully it's clear that a single connection from client to server is sufficient for your requirements.

Maintaining the set of available workers is quite simple:

func (s *Server) GetJobs(req *pb.GetJobRequest, stream pb.Scheduler_GetJobsServer) error {
    ctx := stream.Context()

    s.scheduler.AddWorker(req)
    defer s.scheduler.RemoveWorker(req)

    for {
        job, err := s.scheduler.GetJob(ctx, req)
        switch {
        case ctx.Err() != nil: // client disconnected
            return nil
        case err != nil:
            return err
        }

        if err := stream.Send(job); err != nil {
            return err
        }
    }
}

The Basics tutorial includes examples for all types of streaming, including server and client implementations in Go.

As for registration, that usually just means creating some sort of credential that a worker will use when communicating with the server. This might be a randomly generated token (which the server can use to load associated metadata), or a username/password combination, or a TLS client certificate, or similar. Details will depend on your infrastructure and desired workflow when setting up workers.