Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
Indexing Distributed Data Structures:
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.