Search code examples
apache-sparkpysparkazure-databricks

Create a parquet file with custom schema


I have a requirement like this:

In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.

We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.

I have to create a parquet file with the schema that is coming from the API.

How can we do this in Databricks using PySpark.


Solution

  • You can always pass in the schema when reading:

    schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
    df = spark.read.csv('input.csv', schema=schema)
    df.printSchema()
    df.show()
    

    The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".

    input.csv:

    "name","123.1234","2022-01-01 10:10:00"