I have seen that parquet format uses dictionaries to store some columns and that these dictionaries can be used to speed up the filters if useDictionaryFilter()
is used on the ParquetReader
.
Is there any way to access these dictionaries from java code ?
I'd like to use them to create a list of distinct members of my column and though that it would be faster to read only the dictionary values than scanning the whole column.
I have looked into org.apache.parquet.hadoop.ParquetReader
API but did not found anything.
The methods in org.apache.parquet.column.Dictionary
allow you to:
Once you have a Dictionary
, you can iterate over all indexes to get all entries, so the question boils down to getting a Dictionary
. To do that, use ColumnReaderImpl as a guide:
getDictionary(ColumnDescriptor path, PageReader pageReader) {
DictionaryPage dictionaryPage = pageReader.readDictionaryPage();
if (dictionaryPage != null) {
Dictionary dictionary = dictionaryPage.getEncoding().initDictionary(path, dictionaryPage);
}
}
Please note that a column chunk may contain a mixture of data pages, some dictionary-encoded and some not, because if the dictionary "gets full" (reaches the maximum allowed size), then the writer outputs the dictionary page and the dictionary-encoded data pages and switches to not using dictionary-encoding for the rest of the data pages.