Search code examples
hdfsapache-nifi

Read Parquet Files from HDFS cluster


looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?

example- 3 files present in directory-

hdfs://app/data/customer/file1.parquet

hdfs://app/data/customer/file2.parquet

hdfs://app/data/customer/file3.parquet

Thanks!


Solution

  • If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:

    • A combination of ListHDFS and FetchHDFS
    • GetHDFS

    The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.