Search code examples
azure-data-factoryazure-storage

Error when viewing parquet file in Azure Data Factory


I have been trying recently to create a metadata-driven pipeline with NYC data within ADF but that process has failed apart from when I select a particular file. With this file alone, I can also 'preview data' within ADF. All other files within the same container and directory I get the following error: An error occurred when invoking java, message:

java.lang.NoClassDefFoundError:Could not initialize class com.github.luben.zstd.RecyclingBufferPool
total entry:19
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:90)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:112)
org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readDictionaryPage(ColumnChunkPageReadStore.java:236)
org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:410)
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.<init>(ParquetBatchReaderBridge.java:70)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:64)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)

I cannot find any knowledge base online to identify the source of this error nor anyone else experiencing a similar issue.

Since I can read and copy only one of the files, I have a sense this might be a permission or policy issue but I am quite a newbie in Azure and not sure what to look into if its not the permissions.

In the meantime, I will continue to read about permissions and policies pertaining to containers/directories and files but I might just be looking in the wrong direction.

I would appreciate any suggestions of what I should check or where I could continue to research to understand the issue.

  • trying to isolate the files one by one and realizing one out of six files can be read ('preview data') and also be fed into the pipeline which copies the data from ADLS to an Azure SQL db.
  • tried to see if the issue was that the file size was the issue. Smaller files or ones with similar size return the same error.
  • tried changing the authentication method from Account Key to System Managed Identities but still the error remains
  • Gave the Data Factory resources a role of Storage Account Contributor within the IAM of the Storage Account

This is the second error I get also when selecting other files, reading through it, suggests there is some restriction in the realm of authorization or security mechanism:

An error occurred when invoking java, message: java.lang.UnsatisfiedLinkError:D:\Users\_azbatchtask_1\AppData\Local\Temp\libzstd-jni-1.5.5-59177584671814094752.dll: Your organization used Device Guard to block this app. Contact your support person for more info
no zstd-jni-1.5.5-5 in java.library.path
Unsupported OS/arch, cannot find /win/amd64/libzstd-jni-1.5.5-5.dll or load zstd-jni-1.5.5-5 from system libraries. Please try building from source the jar or providing libzstd-jni-1.5.5-5 in your system.
total entry:30
java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
java.lang.Runtime.loadLibrary0(Runtime.java:870)
java.lang.System.loadLibrary(System.java:1122)
com.github.luben.zstd.util.Native$1.run(Native.java:69)
com.github.luben.zstd.util.Native$1.run(Native.java:67)
java.security.AccessController.doPrivileged(Native Method)
com.github.luben.zstd.util.Native.loadLibrary(Native.java:67)
com.github.luben.zstd.util.Native.load(Native.java:154)
com.github.luben.zstd.util.Native.load(Native.java:85)
com.github.luben.zstd.ZstdOutputStreamNoFinalizer.<clinit>(ZstdOutputStreamNoFinalizer.java:18)
com.github.luben.zstd.RecyclingBufferPool.<clinit>(RecyclingBufferPool.java:18)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:90)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:112)
org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readDictionaryPage(ColumnChunkPageReadStore.java:236)
org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:410)
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.<init>(ParquetBatchReaderBridge.java:70)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:64)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)

Here is a view of the schema of the files:

    {
  "type" : "record",
  "name" : "schema",
  "fields" : [ {
    "name" : "VendorID",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "tpep_pickup_datetime",
    "type" : [ "null", {
      "type" : "long",
      "logicalType" : "local-timestamp-micros"
    } ],
    "default" : null
  }, {
    "name" : "tpep_dropoff_datetime",
    "type" : [ "null", {
      "type" : "long",
      "logicalType" : "local-timestamp-micros"
    } ],
    "default" : null
  }, {
    "name" : "passenger_count",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "trip_distance",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "RatecodeID",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "store_and_fwd_flag",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "PULocationID",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "DOLocationID",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "payment_type",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "fare_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "extra",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "mta_tax",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "tip_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "tolls_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "improvement_surcharge",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "total_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "congestion_surcharge",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "Airport_fee",
    "type" : [ "null", "double" ],
    "default" : null
  } ]
}

Below I've attached screenshots of the NYC Yellow Taxi data dictionary and a screenshot of some files I've download to test the ADF pipeline.

Data Dictionary - NYC Yellow Taxi Data Data Dictionary - NYC Yellow Taxi Data

NYC Yellow Taxi - files NYC Yellow Taxi - files


Solution

  • The error says missing dependency class when you do preview in dataset or copy activity.

    Some of the parquet files require additional drivers and dependencies to read the file successfully, so use dataflow which use spark cluster to read data which consists of all the required dependencies.

    Even i got the same error for NYC trip data, but when i tried in dataflow it read successfully.

    So, while creating dataset you choose Import schema option as None and save it.

    enter image description here

    Next, use this dataset in dataflow source settings.

    Output:

    enter image description here