Search code examples
azure-data-lakeu-sqldata-lake

Does ROWCOUNT hint works for EXTRACT in U-SQL


I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes.

EXTRACT xxxx FROM @"Path" USING new RndsInDataLakeCode.PyramidExtractorMerged() OPTION(ROWCOUNT=50000000); Is there any other way to influence vertexes allocation

Thanks.


Solution

  • Basically the number of vertices used by EXTRACT are being determined by the following:

    1. Number of files (currently at most one file per vertex) if you use file sets or request AtomicFileProcessing=true (e.g., JSON, current Avro Extractor).
    2. Size of a file (currently 1GB per vertex) if the file is considered splittable (AtomicFileProcessing=false, e.g., Csv/Tsv extractors).

    The ROWCOUNT hint will only hint the resulting row count that will impact the subsequent partitioning.

    Then the Analytics Units allocation mentioned by Omid will give you the actual degree of parallelism that is used to parallelize within the determined number of vertices (so overspecifying the Analytics Units will NOT make your code parallelize more).

    Why do you want to increase the scale-out on the extraction?