Search code examples
pythonexcelxlsxparquetapache-nifi

Convert xlsx to parquet


Is it possible to convert a xlsx excel file in parquet without converting in csv ? The thing is that i have many excel files with each many sheets and i don't want to convert each sheet in csv and then in parquet so i wonder if there is a way to convert directly excel to parquet ? Or maybe, is there a way to do it with nifi ? I wanted to do it this way using a python script

def csv_from_excel():

wb = xlrd.open_workbook('your_workbook.xls')
sh = wb.sheet_names()
for i in sh:
    sh = wb.sheet_by_name(i)
    your_csv_file = open('your_csv_file.csv', 'wb')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()
`

Solution

  • From a Nifi perspective, the two interesting questions here are:

    1. Can Nifi pick up this Excel?

    This should not be too difficult when leveraging the XLSX processor, but if your situation is a bit more complex, this elaborate HCC article might be helpful.

    1. Can Nifi write to Parquet?

    This part is easy, with the PutParquet processor, Nifi can directly write to Parquet.