Search code examples
pythonjsonparquetpyarrow

Why I can't parse timestamp in pyarrow?


I have a JSON file with that variable:

"BirthDate":"2022-09-05T08:08:46.000+00:00"

And I want to create parquet based on that file. I prepared fixed schema for pyarrow where BirthDate is a pa.timestamp('s'). And when I trying to convert that file I got error:

ERROR:root:Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000+00:00

My pyarrow code:

parquet_file = pyarrow_json.read_json(json_file, parse_options=pyarrow_json.ParseOptions(
                explicit_schema=prepared_schema,
                unexpected_field_behavior='ignore'))

I have also some files with different types of timestamp (for example without that "+") and it's work fine then.

How can I convert it, and where is a problem with this specific type?


Solution

  • It works for me using pa.field("BirthDate", pa.timestamp('ms')).

    I think it's because your timestamps have got millisecond precision (even though they have their milliseconds set to zero)

    
    import pyarrow.json as pyarrow_json
    import pyarrow as pa
    
    prepared_schema = pa.schema([pa.field("BirthDate", pa.timestamp('ms'))])
    
    parquet_file = pyarrow_json.read_json(
        json_file,
        parse_options=pyarrow_json.ParseOptions(
            explicit_schema=prepared_schema,
            unexpected_field_behavior='ignore')
    )