Search code examples
amazon-web-servicesconcurrencyparallel-processingamazon-redshift

How to enable Redshift concurrent queries


I have been trying to run multiple queries simultaneously on my Redshift cluster (ra3.16xlarge). To my knowledge there should be capacity to run up to 50 queries at once but I only see two at a time, one on each node. To remedy this I adjusted the WLM as detailed below to no avail.

  • created a new redshift parameter group with "concurrency_scaling": "auto"
  • set max_concurrency_scaling_clusters = 10
  • changed the redshift parameter group
  • restarted the cluster 2x times to reflect changes in console
  • ran queries in python with concurrency
  • console only shows one query being executed at a time
  • tested kicking off queries from Python and Query Editor V2

I am still unable to run additional queries and the concurrency clusters do not spin up. Am I missing a setting or potentially a query label? Ideally I would execute these


Solution

  • A few misconceptions here. Please take a look at https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html

    I’ll explain some here.

    First concurrent queries and concurrent clusters are two different but related topics. Concurrent queries is the ability of a Redshift cluster to be executing 2 or more queries at the same time. However, each connection (session) can only execute one query at a time. So multiple connections are needed to get this to happen. I suspect you are doing this but want to get this info out.

    Next all nodes in a cluster execute a query together. It is parallel execution of the query and even when multiple queries are in their execution phase, only one query is making progress on the CPUs (all the computer node CPUs) at once. This is why execution skew is a big deal. You may have misspoken on this when saying “one on each node” and may have meant one on each cluster but it isn’t clear.

    Redshift will spin up a “concurrency scaling cluster” several conditions are met. First the main cluster must be heavily loaded so that there is a benefit for running queries on the remote cluster. Second the queries that can be moved to a remote or scaling cluster MUST BE READ-ONLY. You cannot move INSERTs to a remote cluster. (This is due to coherency concerns.). Also the queries that will move Redshift needs to determine will execute for some time beyond table scanning. Since the data “lives” on the primary cluster (from a coherency point of view at least), data access adds load to the primary cluster. If a query is primarily just scanning then there is little to no benefit to moving it to a remote cluster. All of this needs to be determined before the query is executed (in the compile phase) so up to date metadata is key.

    The ML process that makes the determination of query execution on a scaling (remote) cluster is not adjustable. It is just too complex to expose meaningful knobs