Search code examples
azure-stream-analytics

Can we do query parallelization/ partitioning for a Azure Stream Analytics job which has multiple queries to multiple output types?


Understand from this MS documentation, that Table Storage and Cosmos DB do support partitioning with ASA job.

However, does it work with, for example:

  1. Single ASA job
    • Input: Event hub with 8 partitions
    • Outputs: Table Storage with 4 partitions and Cosmos DB with 4 partitions

Would the above work, as in able to harness the benefit of query parallelization/ partitioning?

Thanks in advance.


Solution

  • Yes, the scenario you have here is supported, but it's a bit more subtle than expected.

    The key here is that both Table Storage and Cosmos DB are output that don't have set partition counts. With a job like yours, from a logical standpoint, ASA will actually run 8 partition writers for each output. It just happens that 4 of them will be running empty. Or rather, instead of 8 writing to 2 outputs each, you will have 8 with 4 writing to one output, and 4 writing to the other.

    Note that if you were to target EH as an output, instead of Table or Cosmos, you would need that EH to have 8 partitions, even if you were only sending data to 4, to keep full parallelization.

    In terms of query, you will need to write something like:

    WITH inputStream AS (
        SELECT * 
        FROM [input]
    )
    
    SELECT * 
    INTO [table]
    FROM inputStream
    WHERE PartitionKey IN (1,2,3,4)
    
    SELECT * 
    INTO [cosmos]
    FROM inputSTream
    WHERE PartitionKey IN (5,6,7,8)
    

    Or any kind of query that doesn't break the partition alignment.