Search code examples
csvapache-nifiparquet

Apache Nifi : How to create parquet file from CSV file with schema saved in "avro.schema" attribute


I am trying to create a parquet file from a CSV file using Apache Nifi.

I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.

I am using Apache Nifi 1.14.0 on Windows Server 2016.

This is what I've tried to convert CSV to parquet till now...

I have used the below 3 controllers

  1. CSVReader
  2. CSVRecordSetWriter
  3. ParquetRecordSetWriter

And, These are the processors/Flow

  1. GetFile
  2. ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
  3. UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
  4. ConvertRecord(CSVReader to ParquetRecordSetWriter)
  5. UpdatedAttribute(For appending '.parquet' in the filename)
  6. PutFile

I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|

This is how parquet file schema looks like after conversion. I want string instead of Struct as output.

enter image description here

Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.

OR
Any other ways to achieve this would be very helpfull.
Thanks!


Solution

  • After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.

    enter image description here