I have a list column in my pandas dataframe along with int, string etc columns. I am able to convert string, date, int and timestamp columns. I want to know how to apply array() to the list column.
fields = [
pa.field('id', pa.int64()),
pa.field('secondaryid', pa.int64()),
pa.field('date', pa.timestamp('ms')),
pa.field('emails', pa.array())
]
my_schema = pa.schema(fields)
table = pa.Table.from_pandas(sample_df, schema=my_schema, preserve_index=False)
It asks for an object to be passed for the array. I want to know how to apply schema for array of type string to 'emails' column, bearing in mind that I was write the table out to parquet format, so an empty array will result in SegFault. What is the best approach?
You need to supply pa.list_(pa.string())
instead of pa.array
. pa.array
is the constructor for a pyarrow.Array
instance. This is the main object holding data of any type. In constrast to this, pa.list_()
is the constructor for the LIST type. As its single argument, it needs to have the type that the list elements are composed of.
In Arrow terms, an array is the most simple structure holding typed data. It consists of a number of buffers of continuous memory. The primary buffer is always a bitmap indicating if a row is valid or null. Depending on the type of the array. There will be a single buffer for that data (e.g. for ints) or multiple ones for more complicated types. In contrast, the term list is used to describe what kind of data is stored in an array. LIST means that a single cell/row in a column can hold multiple values of the same kind.