Search code examples
pythonc++boost-pythonpybind11apache-arrow

Apache Arrow Bus Error/Seg Fault when using Python bindings


I am writing data to parquet files. Apache Arrow provides a straightforward example for doing this: parquet-arrow, in which the data flow is essentially: data => arrow::ArrayBuilder => arrow::Array => arrow::Table => parquet file. This works fine as standalone C++, but when I attempt to bind this code into a python module and call it from python (I'm using Python 3.8.0), a bus error 10 (or seg fault 11) occurs consistently at arrow::ArrayBuilder => arrow::Arrays (i.e. in the ArrayBuilder::Finish function). Does anyone have any idea why this could be occurring or how to correct it?

I have attempted several adjustments to try to workaround this issue, such as using static vs dynamic library linking, employing variations of ArrayBuilder::Finish overloads, and using different tools to create the python module/.so (tried both pybind11 and boost-python), but the error persists. It crashes consistently in arrow::ArrayBuilder::Finish(shared_ptrarrow::Array*). I'm running on macOS. This simple .py and .cc code is enough to re-create the error:

import pybindtest
pybindtest.python_bind_test()
#include <iostream>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
#include <pybind11/pybind11.h>

std::shared_ptr<arrow::Table> generate_table() {
  arrow::Int64Builder i64builder;
  std::shared_ptr<arrow::Array> i64array;
  PARQUET_THROW_NOT_OK(i64builder.AppendValues({2, 4}));
  PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));

  arrow::StringBuilder strbuilder;
  std::shared_ptr<arrow::Array> strarray;
  PARQUET_THROW_NOT_OK(strbuilder.Append("some"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("content"));
  PARQUET_THROW_NOT_OK(strbuilder.Finish(&strarray));

  std::shared_ptr<arrow::Schema> schema = arrow::schema(
      {arrow::field("int", arrow::int64()), 
       arrow::field("str", arrow::utf8())});

  return arrow::Table::Make(schema, {i64array, strarray});
}

void write_parquet_file(const arrow::Table& table) {
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open("pybindtest.parquet"));
  PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

void python_bind_test() {
  std::shared_ptr<arrow::Table> table = generate_table();
  write_parquet_file(*table);
}

PYBIND11_MODULE(pybindtest, m) {
  m.def("python_bind_test", &python_bind_test);
}

This is a backtrace of one of the cores:

$ lldb -c core.84103 
(lldb) target create --core "core.84103"
Core file '/cores/core.84103' (x86_64) was loaded.

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff91b52a58 libc++abi.dylib`vtable for __cxxabiv1::__si_class_type_info + 16
    frame #1: 0x0000000103b1f4c8 libarrow.300.0.0.dylib`arrow::ArrayBuilder::Finish(std::__1::shared_ptr<arrow::Array>*) + 40
    frame #2: 0x0000000103a0c492 pybindtest.cpython-38-darwin.so`generate_table() + 642
    frame #3: 0x0000000103a0e298 pybindtest.cpython-38-darwin.so`python_bind_test() + 24
    frame #4: 0x0000000103a4425f pybindtest.cpython-38-darwin.so`void pybind11::detail::argument_loader<>::call_impl<void, void (*&)(), pybind11::detail::void_type>(void (*&)(), pybind11::detail::index_sequence<>, pybind11::detail::void_type&&) && + 31
    frame #5: 0x0000000103a44136 pybindtest.cpython-38-darwin.so`std::__1::enable_if<std::is_void<void>::value, pybind11::detail::void_type>::type pybind11::detail::argument_loader<>::call<void, pybind11::detail::void_type, void (*&)()>(void (*&)()) && + 54
    frame #6: 0x0000000103a43ff2 pybindtest.cpython-38-darwin.so`void pybind11::cpp_function::initialize<void (*&)(), void, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(), void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const + 130
    frame #7: 0x0000000103a43f55 pybindtest.cpython-38-darwin.so`void pybind11::cpp_function::initialize<void (*&)(), void, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(), void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::__invoke(pybind11::detail::function_call&) + 21
    frame #8: 0x0000000103a2cb62 pybindtest.cpython-38-darwin.so`pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 4818
    frame #9: 0x00000001035cf164 python`cfunction_call_varargs + 68
    frame #10: 0x00000001035ce3a7 python`_PyObject_MakeTpCall + 167
    frame #11: 0x0000000103713228 python`_PyEval_EvalFrameDefault + 45944
    frame #12: 0x0000000103706060 python`_PyEval_EvalCodeWithName + 560
    frame #13: 0x0000000103780a7c python`PyRun_FileExFlags + 364
    frame #14: 0x0000000103780171 python`PyRun_SimpleFileExFlags + 529
    frame #15: 0x00000001037a8c5a python`pymain_run_file + 394
    frame #16: 0x00000001037a81b6 python`pymain_run_python + 486
    frame #17: 0x00000001037a7f88 python`Py_RunMain + 24
    frame #18: 0x00000001037a9670 python`pymain_main + 32
    frame #19: 0x00000001035a1cb9 python`main + 57
    frame #20: 0x00007fff6b8b7cc9 libdyld.dylib`start + 1
    frame #21: 0x00007fff6b8b7cc9 libdyld.dylib`start + 1

Solution

  • Upon further investigation, this error appears to be triggered by some conflict between the arrow-cpp libraries I am building from source and the pyarrow package I was installing from conda-forge. I was able to address the problem simply by pip installing pyarrow into my conda env rather than pulling it from the conda-forge channel (similarly for pyspark in my case too, as it depends on pyarrow).

    Though I don't know exactly the reason for this incompatibility, it may relate to this current MacOS caveat mentioned in the Arrow Python Documentation that states:

    Using conda to build Arrow on macOS is complicated by the fact that the conda-forge compilers require an older macOS SDK. Conda offers some installation instructions; the alternative would be to use Homebrew and pip instead.