Search code examples
c++parquetapache-arrow

How to Read multiple parquet files or a directory using apache arrow in cpp


I am new to apache arrow cpp api. I want to read multiple parquet files using apache arrow cpp api, similar to what is there in apache arrow using python api(as a table). However I don't see any example of it. I know I can read a single parquet file using :

   arrow::Status st;
   arrow::MemoryPool* pool = arrow::default_memory_pool();
   arrow::fs::LocalFileSystem file_system;
   std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("/tmp/data.parquet").ValueOrDie();

   // Open Parquet file reader
   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
   st = parquet::arrow::OpenFile(input, pool, &arrow_reader);

Please let me know if you have any questions. Thanks in advance


Solution

  • The feature is called "datasets"

    There is a fairly complete example here: https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/examples/arrow/dataset_parquet_scan_example.cc

    The C++ documentation for the feature is here: https://arrow.apache.org/docs/cpp/dataset.html

    I'm working on a recipe for the cookbook but I can post some snippets here. These come from this work-in-progress: https://github.com/westonpace/arrow-cookbook/blob/feature/basic-dataset-read/cpp/code/datasets.cc

    Essentially you will want to create a filesystem and select some files:

      // Create a filesystem
      std::shared_ptr<arrow::fs::LocalFileSystem> fs =
          std::make_shared<arrow::fs::LocalFileSystem>();
    
      // Create a file selector which describes which files are part of
      // the dataset.  This selector performs a recursive search of a base
      // directory which is typical with partitioned datasets.  You can also
      // create a dataset from a list of one or more paths.
      arrow::fs::FileSelector selector;
      selector.base_dir = directory_base;
      selector.recursive = true;
    

    Then you will want to create a dataset factory and a dataset:

      // Create a file format which describes the format of the files.
      // Here we specify we are reading parquet files.  We could pick a different format
      // such as Arrow-IPC files or CSV files or we could customize the parquet format with
      // additional reading & parsing options.
      std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
          std::make_shared<arrow::dataset::ParquetFileFormat>();
    
      // Create a partitioning factory.  A partitioning factory will be used by a dataset
      // factory to infer the partitioning schema from the filenames.  All we need to specify
      // is the flavor of partitioning which, in our case, is "hive".
      //
      // Alternatively, we could manually create a partitioning scheme from a schema.  This is
      // typically not necessary for hive partitioning as inference works well.
      std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
          arrow::dataset::HivePartitioning::MakeFactory();
    
      arrow::dataset::FileSystemFactoryOptions options;
      options.partitioning = partitioning_factory;
    
      // Create a dataset factory
      ASSERT_OK_AND_ASSIGN(
          std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
          arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));
    
      // Create the dataset, this will scan the dataset directory to find all of the files
      // and may scan some file metadata in order to determine the dataset schema.
      ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Dataset> dataset,
                           dataset_factory->Finish());
    

    Finally, you will want to "scan" the dataset to get the data:

      // Create a scanner
      arrow::dataset::ScannerBuilder scanner_builder(dataset);
      ASSERT_OK(scanner_builder.UseAsync(true));
      ASSERT_OK(scanner_builder.UseThreads(true));
      ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Scanner> scanner,
                           scanner_builder.Finish());
    
      // Scan the dataset.  There are a variety of other methods available on the scanner as
      // well
      ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::Table> table, scanner->ToTable());
      rout << "Read in a table with " << table->num_rows() << " rows and "
           << table->num_columns() << " columns";