looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?
example- 3 files present in directory-
hdfs://app/data/customer/file1.parquet
hdfs://app/data/customer/file2.parquet
hdfs://app/data/customer/file3.parquet
Thanks!
If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle
. You can use either of the two approaches:
ListHDFS
and FetchHDFS
GetHDFS
The difference between the two approaches is GetHDFS
will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.