Search code examples
apache-arrowapache-arrow-cpp

What is the difference between StringType and LargeStringType in Apache Arrow?


According to documentation:

class arrow::StringType : public arrow::BinaryType
#include <arrow/type.h>
Concrete type class for variable-size string data, utf8-encoded.
class arrow::LargeStringType : public arrow::LargeBinaryType
#include <arrow/type.h>
Concrete type class for large variable-size string data, utf8-encoded.

How large is considered to be "large"?

What are the differences between the two data types? Why do we need 2 instead of 1?


Solution

  • String uses signed 32-bit integers for its offsets/indices so you cannot have a string longer than 2 GiB, and you cannot have an array with more than 2 GiB of data total. LargeString uses 64-bit integers so you can have much longer strings and larger arrays.