Search code examples
apache-sparkpysparkcloudera

java.io.IOException: Stream is corrupted while writing a Big file in Pyspark


I am reading data from SQL server about 9 million rows and inserting it into a table already existing in my datalake(Parquet)

This process worked with less data around 1 million.

I am using the basic read write only for sql server :

enter image description here

My Spark Submit looks like this:

enter image description here

My Pyspark Config :

enter image description here

I have tried repartition and increasing the memory to 15 but still the same issue..

java.io.IOException: Stream is corrupted

Sorry but i dont have access to full logs


Solution

  • When you're reading data such way, it's really only one core is used, as JDBC connector doesn't automatically parallelize reads until it explicitly configured to do so. So most probably the connection is timed out during the read operation.

    You need to look to the JDBC connector options like, partitionColumn, lowerBound, upperBound, etc. that will split the reads into multiple operations. (maybe also look to the fetchsize, etc.)