Ok, so, I have autoloader working in directory listing mode
because the event driven mode
requires way more elevated permissions that we can't in LIVE.
So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .
That's about it.
My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?
From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?
I plan to try all of them out, but if someone already has pointers, will be appreciated.
By the way, the storage here is Azure data lake gen 2.
If you don't do too many complex aggregations, then I would recommend to get to the "Compute Optimized" or "General Purpose" nodes for that work - the primary load would be anyway reading the data from files, combine them together and then write to ADLS, so here the more CPU power, the faster will be the data processing.
Only if you'll have too many small files (think about tens/hundreds of thousands) then you may consider bigger node for a driver whose role will be identifying the new files in the storage.