I've been looking for information for a long time and I can't get it. I'm starting to think it can't be done if the .parquet are in Azure DataLake Storage.
I have a folder with subfolders in Azure DataLake Storage. In these subfolders there are many .parquet. I manage to get them out using ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination. Then I try to pass them through a PutDatabaseRecord (which I think is the correct processor for the dump in the DB).
I think I have the PutDatabaseRecord well configured. But when executing it gives me an error: "Failed to process session due to Failed to process StandardFlowFileRecord due to java.lang.NullPointerException: Name is null".
I'm not sure I'm using the PutDatabaseRecord right. I thought that PutDatabaseRecord read the flowfiles that came to it interpreting their content as .parquet (it is supposed to use a ParquetReader as a RecordReader), being able to understand the data as records. But it surprises me that it is not necessary to indicate how to interpret the .parquet, nor how to map its columns with those of the DB table. It still doesn't work as I think and it needs the flowfile content to already arrive as records?
The truth is that I can't explain myself better either because I don't really understand what is considered a record in Nifi or how a record is related to a reading of a .parquet.
Either I am missing a processor or something I am configuring wrong. But the only thing I find is the FetchParquet, which seems to be able to read a .parquet and put it into the FlowFile as records. However, it can only be used with ListHDFS or ListFile, which do not allow me to fetch data from Azure Data Lake Storage
After several tests (using the ConvertRecord and QueryRecord processors), I have come to the conclusion that the problem is in the reading that the ParquetReader does of the content of the FlowFiles that arrive. Well, every processor that needs a ParquetReader gives the same error. Downloading the content of the FlowFile that enters the processor that the ParquetReader uses (whatever it is) and using a .parquet viewer I have verified that this content is fine. Without knowing what to do, I have attached a screenshot of the specific error. I still don't know what "Name" the error refers to.
Note: I also posted my problem on Cloudera, perhaps better explained. I leave the link in case someone wants to look at it. (https://community.cloudera.com/t5/Support-Questions/How-can-I-dump-the-parquet-data-that-is-in-Azure/td-p/316020)
In the end, the closest thing to the error I was getting was found here (https://issues.apache.org/jira/browse/NIFI-7817). It seems that it is an error related to the creation of the ParquetReader. This makes sense because it would hit any processor that used a ParquetReader. In addition, the FlowFiles did not even enter the processor that used it.
I was using Nifi version 1.12.1. I have downloaded version 1.13.2 and it no longer gives the Name error. In addition, it is seen that the Flow Files already enter the processor. On the download page of the new version (https://nifi.apache.org/download.html) you can access the Release Notes and the Migration Guidance to know what has been fixed with respect to previous versions and with which processors you have to be careful when migrating.
However, even though the data goes into the processor, it still gives me an error, but it is different and I will open it in another post.