Search code examples
azure-synapseazure-data-lakemssparkutils

copying db file from temp directory to Azure data lake with mssparkutils.fs.cp causes checksum error in Azure Synapse


I have a temp directory (tempfile.mkdtemp()) where I make edits to a db file using sqlite3 in an Azure Synapse notebook. When trying to copy the finished db file to mounted data lake storage like so: mssparkutils.fs.cp('file:' + dirpath + '/example_database.db', 'synfs:/' + x + f'/container/example_directory/example_database.db')

Where x = mssparkutils.env.getJobId() I receive this error:

Py4JJavaError: An error occurred while calling z:mssparkutils.fs.cp.
: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/tmpcem75tu1/InventoryForecasting.db at 0
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:264)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:300)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
    at java.io.DataInputStream.read(DataInputStream.java:100)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:415)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
    at com.microsoft.spark.notebook.msutils.impl.MSFsUtilsImpl.cp(MSFsUtilsImpl.scala:247)
    at mssparkutils.fs$.cp(fs.scala:17)
    at mssparkutils.fs.cp(fs.scala)
    at sun.reflect.GeneratedMethodAccessor162.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

I expected the file to copy, and the same method worked fine when trying with both a .txt and an .xlsx file. I can also get the file to copy as expected when using adlfs with fsspec, though it requires me to use put() rather than copy() as I'm copying from local storage to remote storage. There's also no problem with copying a db file from remote storage to either local storage or remote storage, so I think the issue is specific to using mssparkutils to copy a db file out of this temp directory.


Solution

  • I had the same issue with the same use case and was getting checksum error when copying some large CSV files from /mnt/temp to ADLS using mssparkutils.fs.cp() , but I did not have any issue when copying binary files without extension.

    I got a walkaround like this so that Synapse treat my CSV file in the same way as binary file:

    !mv /mnt/temp/FILENAME.csv /mnt/temp/FILENAME
    mssparkutils.fs.cp("file:/mnt/temp/FILENAME", "abfss://****/FILENAME.csv") 
    

    And it works well for me. You may try the same approach.