Search code examples
pythonpysparkparquet

processing parquet file in pyspark on saving giving error


I am doing a process using csv files like this

df = spark.read.csv(path = '/mycsv.csv', header = True)

then saving to db

%sql
CREATE DATABASE IF NOT EXISTS MY_DB

and

df.write.saveAsTable("MY_DB.mycsv")

it's working fine

now In case of parquet I am doing somehow same

df = spark.read.format("parquet").load(path = '/sample.parquet', header = True)

then

df.write.saveAsTable("MY_DB.sample")

and it's giving me error

AnalysisException

AnalysisException: 
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema. 
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.

For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping
Or you can use alias to rename it.

what it means ?

UPDATE

on print schema on parquet file it is showing

root
 |-- Region: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Item Type: string (nullable = true)
 |-- Sales Channel/test: string (nullable = true)
 |-- Order Priority: string (nullable = true)
 |-- Order Date: date (nullable = true)
 |-- Order ID: integer (nullable = true)
 |-- Ship Date: date (nullable = true)
 |-- Units Sold: integer (nullable = true)
 |-- Unit Price: double (nullable = true)
 |-- Unit Cost: double (nullable = true)
 |-- Total Revenue: double (nullable = true)
 |-- Total Cost: double (nullable = true)
 |-- Total Profit: double (nullable = true)

and for csv file it is showing (it is diff file and parquet is diff file)

root
 |-- HashKey: string (nullable = true)
 |-- GLKey: string (nullable = true)
 |-- AccountingDateKey: string (nullable = true)
 |-- MainAccountKey: string (nullable = true)
 |-- LocationKey: string (nullable = true)
 |-- BusinessUnitKey: string (nullable = true)
 |-- DepartmentKey: string (nullable = true)
 |-- CompanyKey: string (nullable = true)
 |-- FinancialHierarchyKey: string (nullable = true)
 |-- FinancialSLIDKey: string (nullable = true)
 |-- FinancialTaxKey: string (nullable = true)
 |-- FinancialPayrollKey: string (nullable = true)
 |-- FinancialCustomerKey: string (nullable = true)
 |-- FinancialVendorKey: string (nullable = true)
 |-- FinancialBankKey: string (nullable = true)
 |-- FinancialInventoryKey: string (nullable = true)
 |-- FinancialIntangiblesKey: string (nullable = true)
 |-- FinancialBankSubKey: string (nullable = true)
 |-- DimGLKey: string (nullable = true)
 |-- DWCreatedDateTime: string (nullable = true)

Solution

  • there is a issue in columns name which were creating issue . after using select in conjunction with a list comprehension it should now returning name of columns as Item_Type

    from pyspark.sql import functions as F
    
    renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])