Search code examples
pyarrow

Pyarrow: How to specify partial schema


I am creating a table with some known columns and some dynamic columns. I would like to specify the data types for the known columns and infer the data types for the unknown columns. Is there a way to do this?

If I create a schema with only the known columns, then the other columns are ignored when creating the table:

n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
pydict = {'n_legs': n_legs, 'animals': animals}
partialSchema = pa.schema([('n_legs', pa.int32())])
pa.Table.from_pydict(pydict, schema=partialSchema)

pyarrow.Table
n_legs: int32
----
n_legs: [[2,4,5,100]]

^^^ The animals column was omitted instead of inferred.


Solution

  • One solution could be to specify the data type for your inputs before you create the table, when you are creating your arrays. Then you do not need to specify a schema:

    n_legs = pa.array([2, 4, 5, 100], pa.int32())
    animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
    pydict = {'n_legs': n_legs, 'animals': animals}
    pa.Table.from_pydict(pydict)
    
    pyarrow.Table
    n_legs: int32
    animals: string
    ----
    n_legs: [[2,4,5,100]]
    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]