Now that QuestDB can generate tables directly in Parquet format I am trying to explore the data using pyarrow with this code:
import pyarrow.dataset as ds
# Define the base directory
base_dir = "/opt/homebrew/var/questdb/db/chicago_parquet_water_sensors~2414"
dataset = ds.dataset(
base_dir,
format="parquet",
partitioning="hive"
)
df = dataset.to_table()
I get this error:
Exception during reset or similar
Traceback (most recent call last):
File "/Users/j/prj/python/questdb_dataset_utils/venv/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/var/folders/_5/q677vj0j0nn7hz56s8_gsgpc0000gn/T/ipykernel_45561/2530093135.py", line 14, in <module>
df = dataset.to_table()
File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '/opt/homebrew/var/questdb/db/chicago_parquet_water_sensors~2414/2023-09.46/MeasurementID.d': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
When I check my folder, I can see most partitions contain a single parquet file, but one partition has no parquet files and that's the one giving me trouble. If I point to an individual partition all is good, but I want to access the whole table.
If I try the same code pointing to a parquet dataset generated elsewhere I don't have any issues, so this must be a QuestDB thing rather than pyarrow.
QuestDB stores all partitions except the active one in Parquet format. The active partition is stored using QuestDB own binary format for efficiency (as it will probably get updates and out of order data which is less efficient to process using parquet).
The easiest workaround is telling arrow to just skip the non valid (non parquet) files, as in:
import pyarrow.dataset as ds
# Define the base directory
base_dir = "/opt/homebrew/var/questdb/db/chicago_parquet_water_sensors~2414"
dataset = ds.dataset(
base_dir,
format="parquet",
partitioning="hive",
exclude_invalid_files=True
)
df = dataset.to_table()
When using other tools, like DuckDB, which accepts wildcards, we can just specify the parquet extension explicitely.
select * from read_parquet('/opt/homebrew/var/questdb/db/chicago_parquet_water_sensors~2414/*/*.parquet', hive_partitioning = true)