I am trying to read a parquet file to save the schema, and then use this schema to assign it to dataframe while reading the csv file.
The file fee.parquet
and loan__fee.csv
has the same contents with different file formats.
Below is my code - I get an error that the schema should be 'StructType'. How do I convert the schema read from parquet file to StructType
from pyarrow.parquet import ParquetFile
import pyarrow.parquet
fee_schema = pyarrow.parquet.read_schema("/dbfs/FileStore/fee.parquet", memory_map=True)
df_mod = spark.read.csv('/FileStore/loan__fee.csv', header="true", schema=fee_schema)
It gives error :
TypeError: schema should be StructType or string
I tried few options such as fee_schema.to_string(show_schema_metadata = True)
but it does not work gives ParseError.
Thanks for your time!
As suggested by mck, you can use spark.read.parquet
to get the schema - this command just fetch metadata from file, not reading it completely. So you'll have something like this:
src_df = spark.read.parquet("/FileStore/fee.parquet")
df_mod = spark.read.csv('/FileStore/loan__fee.csv', header="true",
schema=src_df.schema)