Search code examples
pythongrpcenvoyproxy

Python GRPC - Failed to pick subchannel


I'm trying to setup a GRPC client in Python to hit a particular server. The server is setup to require authentication via access token. Therefore, my implementation looks like this:

def create_connection(target, access_token):
    credentials = composite_channel_credentials(
        ssl_channel_credentials(),
        access_token_call_credentials(access_token))

    target = target if target else DEFAULT_ENDPOINT
    return secure_channel(target = target, credentials = credentials)

conn = create_connection(svc = "myservice", session = Session(client_id = id, client_secret = secret)
stub = FakeStub(conn)
stub.CreateObject(CreateObjectRequest())

The issue I'm having is that, when I attempt to use this connection I get the following error:

File "<stdin>", line 1, in <module>
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 216, in __call__
    response, ignored_call = self._with_call(request,
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 257, in _with_call
    return call.result(), call
File "anaconda3\envs\test\lib\site-packages\grpc\_channel.py", line 343, in result
    raise self
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 241, in continuation
    response, call = self._thunk(new_method).with_call(
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 266, in with_call
    return self._with_call(request,
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 257, in _with_call
    return call.result(), call
File "\anaconda3\envs\test\lib\site-packages\grpc\_channel.py", line 343, in result
    raise self
File "\anaconda3\envs\test\lib\site-packages\grpc\_interceptor.py", line 241, in continuation
    response, call = self._thunk(new_method).with_call(
File "\anaconda3\envs\test\lib\site-packages\grpc\_channel.py", line 957, in with_call
    return _end_unary_response_blocking(state, call, True, None)
File "\anaconda3\envs\test\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{
        "created":"@1633399048.828000000",
        "description":"Failed to pick subchannel",
        "file":"src/core/ext/filters/client_channel/client_channel.cc",
        "file_line":3159,
        "referenced_errors":[
            {
                "created":"@1633399048.828000000",
                "description":
                "failed to connect to all addresses",
                "file":"src/core/lib/transport/error_utils.cc",
                "file_line":147,
                "grpc_status":14
            }
        ]
    }"

I looked up the status code associated with this response and it seems that the server is unavailable. So, I tried waiting for the connection to be ready:

channel_ready_future(conn).result()

but this hangs. What am I doing wrong here?

UPDATE 1

I converted the code to use the async connection instead of the synchronous connection but the issue still persists. Also, I saw that this question had also been posted on SO but none of the solutions presented there fixed the problem I'm having.

UPDATE 2

I assumed that this issue was occurring because the client couldn't find the TLS certificate issued by the server so I added the following code:

def _get_cert(target: str) -> bytes:
    split_around_port = target.split(":")
    data = ssl.get_server_certificate((split_around_port[0], split_around_port[1]))
    return str.encode(data)

and then changed ssl_channel_credentials() to ssl_channel_credentials(_get_cert(target)). However, this also hasn't fixed the problem.


Solution

  • The issue here was actually fairly deep. First, I turned on tracing and set GRPC log-level to debug and then found this line:

    D1006 12:01:33.694000000 9032 src/core/lib/security/transport/security_handshaker.cc:182] Security handshake failed: {"created":"@1633489293.693000000","description":"Cannot check peer: missing selected ALPN property.","file":"src/core/lib/security/security_connector/ssl_utils.cc","file_line":160}

    This lead me to this GitHub issue, which stated that the issue was with grpcio not inserting the h2 protocol into requests, which would cause ALPN-enabled servers to return that specific error. Some further digging led me to this issue, and since the server I connected to also uses Envoy, it was just a matter of modifying the envoy deployment file so that:

    clusters:
      - name: my-server
        connect_timeout: 10s
        type: strict_dns
        lb_policy: round_robin
        http2_protocol_options: {}
        hosts:
        - socket_address:
            address: python-server
            port_value: 1337
        tls_context:
          common_tls_context:
            tls_certificates:
            alpn_protocols: ["h2"] <====== Add this.