Search code examples
pysparkparquet

Parquet read from HDFS and Schema issue


When i try to read a parquet file from HDFS i get the schema in all mixed case. any way we can convert that to all lower case ?

df=spark.read.parquet(hdfs_location)

df.printSchema();
root
|-- RecordType: string (nullable = true)
|-- InvestmtAccnt: string (nullable = true)
|-- InvestmentAccntId: string (nullable = true)
|-- FinanceSummaryID: string (nullable = true)
|-- BusinDate: string (nullable = true)

What i need is like below


root
|-- recordtype: string (nullable = true)
|-- investmtaccnt: string (nullable = true)
|-- investmentaccntid: string (nullable = true)
|-- financesummaryid: string (nullable = true)
|-- busindate: string (nullable = true)

Solution

  • First read the parquet files

    df=spark.read.parquet(hdfs_location)
    

    then use .toDF function to create dataframe with all lower column names

    df=df.toDF(*[c.lower() for c in df.columns])