Create a parquet file with custom schema

I have a requirement like this:

In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.

We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.

I have to create a parquet file with the schema that is coming from the API.

How can we do this in Databricks using PySpark.

Solution

You can always pass in the schema when reading:

schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()

The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".

input.csv:

"name","123.1234","2022-01-01 10:10:00"

difference between spark.kubernetes.driver.request.cores, spark.kubernetes.driver.limit.cores and spark.driver.cores
Pyspark Streaming data to Elastic search index from Kafka topic , running in Jupyter notebook, causing failure
Spark Send DataFrame as body of HTTP Post request
How to handle an AnalysisException on Spark SQL?
How convert a list into multiple columns and a dataframe?
PySpark Window functions: Aggregation differs if WindowSpec has sorting
Using rangeBetween considering months rather than days in PySpark
Pyspark replace strings in Spark dataframe column
Spark read from MongoDB and filter by objectId indexed field
How to specify file size using repartition() in spark
BloomFilter mergeInPlace() producing unexpected behavior
Spark reading from mutiple SQL databases in parallel
Spark partition size greater than the executor memory
Last day of quarter
Java spark Map is empty
corrupted record from json file in pyspark due to False as entry
Chain several WHEN conditions in a scalable way in PySpark
How to extract all elements from array of structs?
How to get the schema definition from a dataframe in PySpark?
Get index of item in array column in a Spark dataframe
How to find difference between two sequential array items in spark sql
How to bootstrap installation of Python modules on Amazon EMR?
How to find installation directory of Apache Spark package in Homebrew?
checksum error while writing data to delta table. Is there a way to fix this issue?
Insert column at specified position
Count particular characters within a column using Spark Dataframe API
Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically
Can you construct pyspark.pandas.DataFrame from pyspark.sql.dataframe.DataFrame?
Feed the result of one query to another in the same Spark Structured Streaming app
AssertDataFrameEqual doesn't throw error with None dataframe in Pyspark