Search code examples
python-3.xpython-docx

extremely slow add a table to python-docx from a csv file


I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.

I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose

My question is if this could be normal behavior or if exist any other way to improve this task.

I'm using this for loop to read several cvs files and parsing it in table format

        for toTAB in listBRUTO:
            df= pd.read_csv(toTAB)
            
            # add a table to the end and create a reference variable
            # extra row is so we can add the header row
            t = doc.add_table(df.shape[0]+1, df.shape[1])
            t.style = 'LightShading-Accent1' # border
           
        
            # add the header rows.
            for j in range(df.shape[-1]):
                t.cell(0,j).text = df.columns[j]
                
            # add the rest of the data frame
            for i in range(df.shape[0]):
                for j in range(df.shape[-1]):
                    t.cell(i+1,j).text = str(df.values[i,j])
            
            #TABLE Format
            for row in t.rows:
                for cell in row.cells:
                    paragraphs = cell.paragraphs
                    for paragraph in paragraphs:
                        for run in paragraph.runs:
                            font = run.font
                            font.name = 'Calibri'
                            font.size= Pt(7)

            
            doc.add_page_break()
        doc.save('blabla.docx')

Thanks in advance


Solution

  • You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.

    I would start with refactoring this block and see how much improvement that yields:

    # --- add the rest of the data frame ---
    for i in range(df.shape[0]):
        for j, cell in enumerate(table.rows[i + 1].cells):
            cell.text = str(df.values[i, j])