I have a scenario in which I am ingesting data from a MS SQL DB into Azure Data Lake using U-SQL. My table is quite big, with over 16 millions records (soon it will be much more). I just do a SELECT a, b, c FROM dbo.myTable;
I realized, however, that only one vertex is used to read from the table.
My question is, is there any way to leverage parallelism while reading from a SQL table?
Queries to external datasources are not automatically parallelized in U-SQL. (This is something we are considering for the future)
wBob's answer does give one option for achieving somewhat the same effect - though it of course requires you to manually partition and query the data using multiple U-SQL statements.
Please note that doing parallel read in a non-transacted environment can lead to duplicate or missed data if parallel writes occur at the source. So some care needs to be taken and the users will need to know the tradeoffs.