java.io.IOException: Stream is corrupted while writing a Big file in Pyspark

I am reading data from SQL server about 9 million rows and inserting it into a table already existing in my datalake(Parquet)

This process worked with less data around 1 million.

I am using the basic read write only for sql server :

My Spark Submit looks like this:

My Pyspark Config :

I have tried repartition and increasing the memory to 15 but still the same issue..

java.io.IOException: Stream is corrupted

Sorry but i dont have access to full logs

Solution

When you're reading data such way, it's really only one core is used, as JDBC connector doesn't automatically parallelize reads until it explicitly configured to do so. So most probably the connection is timed out during the read operation.

You need to look to the JDBC connector options like, partitionColumn, lowerBound, upperBound, etc. that will split the reads into multiple operations. (maybe also look to the fetchsize, etc.)