Iv'e been reading a bit about the parquet format and how spark integrates with it.
Being a columnar storage, parquet really shines whenever spark can collaborate with the underlying storage in order to perform projections without having to load all the data as well as instructing the storage to load specific column chunks based on various statistics (when a filter is involved).
I saw a lecture on youtube (21:54) cautioning that object stores do not support pushdown filters (specifically Amazon S3 was given as an example).
How does the Azure Blob Storage fare regarding this (when we read the session parquets)?
She's wrong. More specifically, even in the Feb 2017 talk she was wrong about S3 in Hadoop 2.8 HADOOP-13203; backported to CDH and HDP for ages.
Azure has had it since Aug 2017, HADOOP-14535, which is backported to shipping Azure HD/Insights and HDP (check with Cloudera about CDH).
The problem she's alluding to is that seek() is expensive on an HTTP connection because if there are many GB of data to D/L, you need to abort the connection & set up a new one. The Hadoop patches above change the IO mode for the stores to optimise for random access by doing GETs with limited content length, letting you reuse the same HTTP1.1 connection. This is pathological for full file reads; S3A makes you ask for it (fs.s3a.experimental.fadvise=random
); Azure switches to random IO on the first backwards seek.
It's nothing to do with predicate pushdown at all: that's all done in the ParquetFileFormat, it's just that seeking, especially backwards seeking, is very expensive if you need to set up new HTTP connections. And as the ORC and Parquet formats put the column summaries after the column blocks, there's a lot of that For more details, have a look at this other talk from the same conference.