Search code examples
pythoncsvparquetfastparquet

Divide parquet file on subfiles using fastparquet


I need to convert a csv file to Parquet format. But this csv file is very huge (more than 65 000 rows and 1 000 columns), that's why I need to divide my parquet file into several subfiles by 5 000 rows and 200 columns in each one). I have already tried partition_on and row_group_offsets, but it doesn't work.

My code :

import pandas as pd
import fastparquet as fp

df = pd.read_csv('D:\Users\mim\Desktop\SI\LOG\LOG.csv')
fp.write(r'D:\Users\mim\Desktop\SI\newdata.parq', df)

Solution

  • [CORRECT ANSWER] :

    import os
    import pandas as pd
    import fastparquet as fp
    
    pathglobalcsv = 'D:\Users\mim\Desktop\SI'
    inputFile = os.path.join(pathglobalcsv, 'LOG.csv')
    
    table = pd.read_csv(inputFile, delimiter=';', chunksize=5000)
    listrow = list(table)
    columnCount = len(listrow[0])
    
    fileCounter = 0
    
    for row in listrow:
        for col in range(1, columnCount, 199):
            timestamp = row.ix[:, 0] #timestamp
            timestampcolumn = pd.DataFrame(timestamp)
            i=col+199
            maincols = row.ix[:, col:i] #other columns
            maincolumns = pd.DataFrame(maincols)
            outputDF = pd.concat([timestampcolumn, maincolumns], axis=1)
    
            #create a new file
            fileCounter += 1
    
            #parquet file
            fp.write(r'C:\Users\mim\eclipse-workspace\SI\file.part_' + str(fileCounter) + '.par', outputDF)