Search code examples
apache-sparkpysparkdatabricksazure-databrickspyarrow

get schema for Parquet file in StructType format


I am trying to read a parquet file to save the schema, and then use this schema to assign it to dataframe while reading the csv file.

The file fee.parquet and loan__fee.csv has the same contents with different file formats.

Below is my code - I get an error that the schema should be 'StructType'. How do I convert the schema read from parquet file to StructType

from pyarrow.parquet import ParquetFile
import pyarrow.parquet
fee_schema = pyarrow.parquet.read_schema("/dbfs/FileStore/fee.parquet", memory_map=True)

df_mod = spark.read.csv('/FileStore/loan__fee.csv', header="true", schema=fee_schema)

It gives error :

TypeError: schema should be StructType or string

I tried few options such as fee_schema.to_string(show_schema_metadata = True) but it does not work gives ParseError.

Thanks for your time!


Solution

  • As suggested by mck, you can use spark.read.parquet to get the schema - this command just fetch metadata from file, not reading it completely. So you'll have something like this:

    src_df = spark.read.parquet("/FileStore/fee.parquet")
    df_mod = spark.read.csv('/FileStore/loan__fee.csv', header="true", 
        schema=src_df.schema)