I hope you can help me because I'm currently stuck.
So, I was given the task to convert several csv files to parquet format. These parquet files will be used as inputs for another script to be processed.
Another engineer have created the conversion script but my current employer somehow don't have it and now that engineer have left the company. So they asked me to create another conversion script.
I tried to create the conversion script, but the result is not accepted by the processing script as it keeps throwing errors . But when I used one of the old parquet files that was created by the previous engineer conversion script, the processing script run perfectly.
I have the data in both csv format and in parquet format. So my goal is to create a conversion script that when executed on the csv format, could produce the same parquet format file.
My question:
FYI, The errors thrown by the processing script mainly because the parquet files produce by my current script do not use the expected datatypes on the columns.
Any inputs will be greatly appreciated!
You need to inspect the schema and the metadata of the parquet files.
The schema will be particularly useful in terms of information about data types.
EDIT: Considering the pyarrow
module:
import pyarrow.parquet as pq
import json
import pandas as pd
# load legacy parquet file
old_tbl = pq.read_table('old_file.parquet')
# get the metadata key
print(old_tbl.schema.metadata.keys())
# let's say the result was b'pandas'...
# create a dictionary with metadata information
old_info = json.loads(old_tbl.schema.metadata[b'pandas'].decode('utf-8'))
# get the metadata field names
print(old_info.keys())
# finally, inspect each metadata field
# p.e. column types
print(pd.DataFrame(old_info['columns']))
# p.e. pandas version used by him
print(old_dict['pandas_version'])
# p.e. pyarrow version used by him (assuming he used it too)
print(old_dict['creator'])
# and so on
With all this information, you can create new parquet files that carries the process expected data types.