Search code examples
dockerdnsamazon-ecsdagsteraws-cloudmap

Error on accessing a Dagster Usercode server on ECS?


I have a problem with the usercode server. The webserver cannot reach it, and even though I systematically permutated all options I can think of at the different variables, I cannot get it to work.

The Code evolved out of the original deploy_ecs example which I needed to adapt due to the ceased support of ECS by docker. I created a terraform setup for the ECS structure and the ECR registry, and I adapted the docker-compose file to the new structure. The usercode server is running, but the webserver cannot reach it. The error message is about DNS resolution. The usercode server is running on port 4000, and the webserver is trying to reach it on dagster-usercode.dagster.local:4000. This is the Cloud Map service name, and it should be resolved to the IP address of the usercode server. This is not happening.

Distributed installation in ECS:

  • 1 Webserver
  • 1 Daemon Server
  • 1 Usercode Server

workspace.yaml on webserver:

load_from:
  - python_file: job2.py
  - python_package:
      location_name: "webserver-jobs"
      package_name: job3
  - grpc_server:
      host: dagster-usercode.dagster.local
      port: 4000
      location_name: "usercode-jobs"

Cloud Map in AWS:

  • Namespace dagster.local
  • Cloud Map Service dagster-usercode, pointing to ECS Service instance dagster_usercode

ECS in AWS:

  • Service dagster_usercode is running the task that refers to the usercode server container in ECR, which is found and running according to logs.

start command on usercode server is implemented like this:

dagster api grpc -h 0.0.0.0 -p 4000 -f sample_jobs.py

When loading the workspace.yaml on the webserver, I see these effects:

  • job2 is loaded and can be called
  • job3: Error
dagster._core.errors.DagsterInvariantViolationError: No repositories, jobs, pipelines, graphs, or asset definitions found in "job3".  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/server.py", line 408, in __init__    self._loaded_repositories: Optional[LoadedRepositories] = LoadedRepositories(  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/server.py", line 242, in __init__    loadable_targets = get_loadable_targets(  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/utils.py", line 60, in get_loadable_targets    else loadable_targets_from_python_package(package_name, working_directory)  File "/usr/local/lib/python3.10/site-packages/dagster/_core/workspace/autodiscovery.py", line 51, in loadable_targets_from_python_package    return loadable_targets_from_loaded_module(module)  File "/usr/local/lib/python3.10/site-packages/dagster/_core/workspace/autodiscovery.py", line 116, in loadable_targets_from_loaded_module    raise DagsterInvariantViolationError(
  • grpc: Error
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE  File "/usr/local/lib/python3.10/site-packages/dagster/_core/workspace/context.py", line 614, in _load_location    else origin.create_location(self.instance)  File "/usr/local/lib/python3.10/site-packages/dagster/_core/host_representation/origin.py", line 373, in create_location    return GrpcServerCodeLocation(self, instance=instance)  File "/usr/local/lib/python3.10/site-packages/dagster/_core/host_representation/code_location.py", line 632, in __init__    list_repositories_response = sync_list_repositories_grpc(self.client)  File "/usr/local/lib/python3.10/site-packages/dagster/_api/list_repositories.py", line 20, in sync_list_repositories_grpc    api_client.list_repositories(),  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 250, in list_repositories    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 173, in _query    self._raise_grpc_exception(  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 156, in _raise_grpc_exception    raise DagsterUserCodeUnreachableError(The above exception was caused by the following exception:grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:   status = StatusCode.UNAVAILABLE details = "DNS resolution failed for dagster-usercode.dagster.local:4000: C-ares status is not ARES_SUCCESS qtype=A name=dagster-usercode.dagster.local is_balancer=0: Domain name not found"   debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-02-12T16:21:15.104749061+00:00", grpc_status:14, grpc_message:"DNS resolution failed for dagster-usercode.dagster.local:4000: C-ares status is not ARES_SUCCESS qtype=A name=dagster-usercode.dagster.local is_balancer=0: Domain name not found"}">  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 171, in _query    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 141, in _get_response    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1160, in __call__    return _end_unary_response_blocking(state, call, False, None)  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1003, in _end_unary_response_blocking    raise _InactiveRpcError(state)  # pytype: disable=not-instantiableThe above exception occurred during handling of the following exception:dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/server_watcher.py", line 119, in watch_grpc_server_thread    watch_for_changes()  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/server_watcher.py", line 82, in watch_for_changes    new_server_id = client.get_server_id(timeout=REQUEST_TIMEOUT)  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 233, in get_server_id    res = self._query("GetServerId", api_pb2.Empty, timeout=timeout)  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 173, in _query    self._raise_grpc_exception(  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 156, in _raise_grpc_exception    raise DagsterUserCodeUnreachableError(The above exception was caused by the following exception:grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:   status = StatusCode.UNAVAILABLE details = "DNS resolution failed for dagster-usercode.dagster.local:4000: C-ares status is not ARES_SUCCESS qtype=AAAA name=dagster-usercode.dagster.local is_balancer=0: Domain name not found"    debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-02-12T16:21:04.309598593+00:00", grpc_status:14, grpc_message:"DNS resolution failed for dagster-usercode.dagster.local:4000: C-ares status is not ARES_SUCCESS qtype=AAAA name=dagster-usercode.dagster.local is_balancer=0: Domain name not found"}">  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 171, in _query    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 141, in _get_response    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1160, in __call__    return _end_unary_response_blocking(state, call, False, None)  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1003, in _end_unary_response_blocking    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable

What's missing? from the ECS side all looks like a go. The job from the package is not found, but this is not the main issue. My major concern is the usercode server, and I am at a loss for next steps.

Any help is highly appreciated.


Solution

  • The problem was in fact the AWS VPC Setup - AWS Cloud Map Name Spaces are not compatible with self-managed DNS. Once the setup was correct, all machines had to be rebuilt because resolv.conf was infected from the wrong setup.