Search code examples
pythonpysparkpycharm

pypsark on windows error while writing data frame to csv


I am trying setup local development environment for py-spark on windows 10 machine and pycharm. So far I am able to read various sources and do transformations. But when I am trying to write transformed data to local system using df.write() it is failing with below error.

I tried various answers on this topic but all those are like shooting in darkness. Becuase what worked for one user did not worked for other. I have winunitl.exe and hadoop.dll in respective folders. Any help understanding and fixing this issue will be great.

This error is reproductible in my machine using below code, I checked in pyspark shell and there too I am getting this error:

from pyspark.sql.types import IntegerType
my_list = [1, 2, 3]
df = spark.createDataFrame(my_list, IntegerType())
df.show()
df.write.csv("mypath")

This code is able to show dataframe and create a directory in the write path but not writing anything there.

Loading target table
Traceback (most recent call last):
  File "E:\pyspark_boilerlpat_beginners\pipeline.py", line 35, in <module>
    pipeline.run_pipeline()
  File "E:\pyspark_boilerlpat_beginners\pipeline.py", line 25, in run_pipeline
    load_process.load_target(transformed_df)
  File "E:\pyspark_boilerlpat_beginners\load.py", line 17, in load_target
    df.write.partitionBy("workclass", "race", "sex").mode("Overwrite").option("header", "true").csv("./_data/transformed_salary_csv/")
  File "C:\spark3\python\pyspark\sql\readwriter.py", line 1372, in csv
    self._jwrite.csv(path)
  File "C:\Python\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "C:\spark3\python\pyspark\sql\utils.py", line 111, in deco
    return f(*a, **kw)
  File "C:\Python\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o41.csv.
: ExitCodeException exitCode=-1073741515: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:547)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:587)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:586)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:586)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
    at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:354)
    at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:979)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

I have tried following suggestions as mentioned in other answers and it did not work.

  1. Writing less amount of data
  2. Placing hadoop.dll in Windows/system32
  3. Replacing winutil.exe as that might have problem
  4. Hadoop and Spark path are set properly.
  5. TEMP and TMP path are set as per system setting.
  6. Updated microsoft visual c++ and installed for x86 systems, no luck yet.

Solution

  • Finally I found the issue after loss of precious hours. Windows machine are real headache. On mac it works like charm.
    This is a issue with Windows. The program can't start because MSVCP100.dll is missing from computer. Reinstalling the program will fix this problem.

    I was needed to install the VC++ redistributable package 2010 version .

    Download Microsoft Visual C++ 2010 Service Pack 1 Redistributable Package from the Official Microsoft Download Center. Installing Microsoft Visual C++ 2010 x64 from here Redistributable (vcredist_x64.exe) resolved the issue.

    https://www.microsoft.com/en-au/download/details.aspx?id=26999