Search code examples
pythoncsvparquet

Comparing and Generating Parquet Files in Python


I hope you can help me because I'm currently stuck.

So, I was given the task to convert several csv files to parquet format. These parquet files will be used as inputs for another script to be processed.

Another engineer have created the conversion script but my current employer somehow don't have it and now that engineer have left the company. So they asked me to create another conversion script.

I tried to create the conversion script, but the result is not accepted by the processing script as it keeps throwing errors . But when I used one of the old parquet files that was created by the previous engineer conversion script, the processing script run perfectly.

I have the data in both csv format and in parquet format. So my goal is to create a conversion script that when executed on the csv format, could produce the same parquet format file.

My question:

  1. How can I compare the parquet files produce my by current script to the correct parquet files so I can find out the difference.
  2. Is it possible to reverse engineer the parquet files to find out how the previous script do the conversion?

FYI, The errors thrown by the processing script mainly because the parquet files produce by my current script do not use the expected datatypes on the columns.

Any inputs will be greatly appreciated!


Solution

  • You need to inspect the schema and the metadata of the parquet files.

    The schema will be particularly useful in terms of information about data types.

    EDIT: Considering the pyarrow module:

    import pyarrow.parquet as pq
    import json
    import pandas as pd
    
    # load legacy parquet file
    old_tbl = pq.read_table('old_file.parquet')
    
    # get the metadata key
    print(old_tbl.schema.metadata.keys())
    
    # let's say the result was b'pandas'...
    # create a dictionary with metadata information
    old_info = json.loads(old_tbl.schema.metadata[b'pandas'].decode('utf-8'))
    
    # get the metadata field names
    print(old_info.keys())
    
    # finally, inspect each metadata field
    # p.e. column types
    print(pd.DataFrame(old_info['columns']))
    
    # p.e. pandas version used by him
    print(old_dict['pandas_version'])
    
    # p.e. pyarrow version used by him (assuming he used it too)
    print(old_dict['creator'])
    # and so on
    

    With all this information, you can create new parquet files that carries the process expected data types.