I am doing a process using csv files like this
df = spark.read.csv(path = '/mycsv.csv', header = True)
then saving to db
%sql
CREATE DATABASE IF NOT EXISTS MY_DB
and
df.write.saveAsTable("MY_DB.mycsv")
it's working fine
now In case of parquet I am doing somehow same
df = spark.read.format("parquet").load(path = '/sample.parquet', header = True)
then
df.write.saveAsTable("MY_DB.sample")
and it's giving me error
AnalysisException
AnalysisException:
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema.
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.
For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping
Or you can use alias to rename it.
what it means ?
UPDATE
on print schema on parquet file it is showing
root
|-- Region: string (nullable = true)
|-- Country: string (nullable = true)
|-- Item Type: string (nullable = true)
|-- Sales Channel/test: string (nullable = true)
|-- Order Priority: string (nullable = true)
|-- Order Date: date (nullable = true)
|-- Order ID: integer (nullable = true)
|-- Ship Date: date (nullable = true)
|-- Units Sold: integer (nullable = true)
|-- Unit Price: double (nullable = true)
|-- Unit Cost: double (nullable = true)
|-- Total Revenue: double (nullable = true)
|-- Total Cost: double (nullable = true)
|-- Total Profit: double (nullable = true)
and for csv file it is showing (it is diff file and parquet is diff file)
root
|-- HashKey: string (nullable = true)
|-- GLKey: string (nullable = true)
|-- AccountingDateKey: string (nullable = true)
|-- MainAccountKey: string (nullable = true)
|-- LocationKey: string (nullable = true)
|-- BusinessUnitKey: string (nullable = true)
|-- DepartmentKey: string (nullable = true)
|-- CompanyKey: string (nullable = true)
|-- FinancialHierarchyKey: string (nullable = true)
|-- FinancialSLIDKey: string (nullable = true)
|-- FinancialTaxKey: string (nullable = true)
|-- FinancialPayrollKey: string (nullable = true)
|-- FinancialCustomerKey: string (nullable = true)
|-- FinancialVendorKey: string (nullable = true)
|-- FinancialBankKey: string (nullable = true)
|-- FinancialInventoryKey: string (nullable = true)
|-- FinancialIntangiblesKey: string (nullable = true)
|-- FinancialBankSubKey: string (nullable = true)
|-- DimGLKey: string (nullable = true)
|-- DWCreatedDateTime: string (nullable = true)
there is a issue in columns name which were creating issue . after using select in conjunction with a list comprehension it should now returning name of columns as Item_Type
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])