Search code examples
dictionaryparquet

How can I read the parquet dictionary in java


I have seen that parquet format uses dictionaries to store some columns and that these dictionaries can be used to speed up the filters if useDictionaryFilter() is used on the ParquetReader.

Is there any way to access these dictionaries from java code ?
I'd like to use them to create a list of distinct members of my column and though that it would be faster to read only the dictionary values than scanning the whole column.

I have looked into org.apache.parquet.hadoop.ParquetReader API but did not found anything.


Solution

  • The methods in org.apache.parquet.column.Dictionary allow you to:

    • Query the range of dictionary indexes: Between 0 and getMaxId().
    • Look up the entry corresponding to any index, for example for an int field you can use decodeToInt().

    Once you have a Dictionary, you can iterate over all indexes to get all entries, so the question boils down to getting a Dictionary. To do that, use ColumnReaderImpl as a guide:

    getDictionary(ColumnDescriptor path, PageReader pageReader) {
      DictionaryPage dictionaryPage = pageReader.readDictionaryPage();
      if (dictionaryPage != null) {
        Dictionary dictionary = dictionaryPage.getEncoding().initDictionary(path, dictionaryPage);
      }
    }
    

    Please note that a column chunk may contain a mixture of data pages, some dictionary-encoded and some not, because if the dictionary "gets full" (reaches the maximum allowed size), then the writer outputs the dictionary page and the dictionary-encoded data pages and switches to not using dictionary-encoding for the rest of the data pages.