Search code examples
c++parquetpython-polarsapache-arrow

Reading a column of strings from a parquet file into C++ string variables


I have a parquet file which I have created using the python polars package. It has a single column of variable length strings that looks like:

┌──────────┐
│ str_list │
│ ---      │
│ str      │
╞══════════╡
│ ALV5     │
│ SMGWX    │
│ NEGOT    │
│ S2U0S    │
│ …        │
│ KFO      │
│ LJ3J     │
│ PCY6O    │
│ GQ0W7    │
└──────────┘

I try to read this file using C++ into string variables but I am not sure what I should cast it to since the type turns out to be LARGE_STRING:

assert(record_batch->column(0)->type_id() == arrow::Type::LARGE_STRING)

is true.

I can do

auto strlist = std::static_pointer_cast<arrow::LargeStringArray>(record_batch->column(0));

but I cannot find any member function of strlist that lets me copy the strings to my own string variable.


Solution

  • You can use LargeStringArray::GetView(i) to get a std::string_view so you don't have to heap-allocate a std::string for every string in the array.