With bigtable and python, what are the causes of an exception like google.api_core.exceptions.Aborted: 409 Error while reading table?

I'm getting this exception when using read_rows on a table. The table has rows for features of documents, each document has 300 to 800 features and there are about 2 million documents. The row_key is the feature, the columns are the document ids that have that feature. There are billions of rows.

I'm running this on a 16 CPU VM on GCP and the load averages are between 6 and 10. I'm using the python bigtable SDK and python 3.6.8 and google-cloud-bigtable 2.3.3.

I'm getting this kind of exception when reading the rows using table.read_rows(start_key=foo#xy, end_key=foo#xz). foo#xy and foo#xy are from table.sample_row_keys(). I get 200 prefixes from sample_row_keys and I successfully process the first 5 or so before I get this error. I'm running the table.read_rows() call in a ThreadPool.

If you've encountered an exception like this and investigated it, what was the cause of it and what did you do to prevent it?

Traceback (most recent call last):
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/grpc_helpers.py", line 106, in __next__
    return next(self._wrapped)
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/grpc/_channel.py", line 809, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.ABORTED
        details = "Error while reading table 'projects/arxiv-production/instances/docsim/tables/docsim' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)"
        debug_error_string = "{"created":"@1635477504.521060666","description":"Error received from peer ipv4:172.217.0.42:443","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Error while reading table 'projects/arxiv-production/instances/docsim/tables/docsim' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 53, in <module>
    for hash, n, c, dt in pool.imap_unordered( do_prefix, jobs ):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 33, in do_prefix
    for hash, common, papers in by_prefix(db, start, end):
  File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 15, in by_prefix
    for row in db.table.read_rows(start_key=start, end_key=end):
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 485, in __iter__
    response = self._read_next_response()
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 474, in _read_next_response
    return self.retry(self._read_next, on_error=self._on_error)()
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 470, in _read_next
    return six.next(self.response_iterator)
  File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/grpc_helpers.py", line 109, in __next__
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.Aborted: 409 Error while reading table 'projects/testproject/instances/testinstance/tables/testtable' : 
Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)

Solution

I worked around this by calling read_rows with a much smaller range. The prefixes from table.sample_row_keys() were spanning around 1.5B rows. Bisecting each range 5 times to produce smaller ranges worked.

I bisected by padding out the start and end row_keys to the same length, converting those to ints and finding the midpoint.