Search code examples
azure-data-lakeu-sql

Custom JSON outputter using only one vertex for execution


We need to distribute the data using JSON and thus we wrote a custom outputter. We are also outputting same data as csv for another vendor. On investigation I found that JSON outputter is using one vertices whereas csv is using 5 vertices to output same data and JSON took long time as well. May I request the reason behind the behavior and is there a way so that we change this?


Solution

  • Actually the reason why you only get a single vertex for JSON but 5 vertices for CSV is very simple.

    JSON is a hierarchical data format, and thus needs the whole rowset in a single vertex so it knows what the structure will be. Even if the outputter outputs a JSON array of objects representing the rows, the array begin and end is kind of a nesting (you will need to know what the first and last row is).

    If you used the sample outputter from the Microsoft U-SQL GitHub page, that outputter was implemented with AtomicFileProcessing turned on for this reason.

    CSV is a flat, row-by-row format. Thus you can partition the rowset into subsets and serialize them individually. There is no structure impeding parallelization.

    So unless you decide to output 1 JSON document by row (thus turning the combined output into an invalid JSON document), you cannot parallelize the hierarchical output.