Search code examples
.netvb.netopenxmlopenxml-sdk

OpenXML gets slower per row with larger files?


I'm building an app in VB.NET to read rows from an Excel file and populate them into a DataTable.

dtRow = dataTable.NewRow()
Dim startTime As DateTime = DateTime.Now

dtRow("name") = suppliers.CellValue("A", rowCount)
/* SNIP - just more string retrieval */
dtRow("statistics") = suppliers.CellValue("P", rowCount)

dataTable.Rows.Add(dtRow)

Dim endTime As DateTime = DateTime.Now

Debug.Print(String.Format("Time elapsed to retrieve '{0}': {1} ms", rowCount, (endTime - startTime).ToString("fffffff")))

CellValue is my own creation- but it is a small function and I've already measured its elapsed time. It's quite fast.

However, when I open a 10,000 row Excel file (filled with identical data), the time to process gets much slower.

3,000 rows:

Time elapsed to retrieve '2': 0510051 ms
Time elapsed to retrieve '3': 0500050 ms
Time elapsed to retrieve '4': 0340034 ms
Time elapsed to retrieve '5': 0350035 ms
Time elapsed to retrieve '6': 0340034 ms
Time elapsed to retrieve '7': 0340034 ms
Time elapsed to retrieve '8': 0350035 ms

6,000 rows:

Time elapsed to retrieve '2': 0710071 ms
Time elapsed to retrieve '3': 0760076 ms
Time elapsed to retrieve '4': 0620062 ms
Time elapsed to retrieve '5': 0670067 ms
Time elapsed to retrieve '6': 0750075 ms
Time elapsed to retrieve '7': 0750075 ms
Time elapsed to retrieve '8': 0700070 ms

10,000 rows:

Time elapsed to retrieve '2': 0920092 ms
Time elapsed to retrieve '3': 0920092 ms
Time elapsed to retrieve '4': 1790179 ms
Time elapsed to retrieve '5': 1810181 ms
Time elapsed to retrieve '6': 1930193 ms
Time elapsed to retrieve '7': 2240224 ms
Time elapsed to retrieve '8': 1820182 ms

enter image description here

Why does this occur? Can I fix it?

EDIT: suppliers is a class I have created to handle the Excel file, with this constructor:

Public Sub New(ByVal doc As SpreadsheetDocument, ByVal sheetName As String)
    pWorkbookPart = doc.WorkbookPart

    Dim sheet As Sheet = pWorkbookPart.Workbook.Descendants(Of Sheet).Where(Function(s) s.Name = sheetName).FirstOrDefault()

    pWorksheetPart = CType(pWorkbookPart.GetPartById(sheet.Id), WorksheetPart)

    pSharedStringTable = pWorkbookPart.GetPartsOfType(Of SharedStringTablePart).FirstOrDefault()
End Sub

CellValue:

Public Function CellValue(ByVal column As String, ByVal row As Integer) As String
    Dim cellAddress As String = column & row
    Dim cell As Cell = pWorksheetPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault()

    Dim index As Integer
    Dim returnValue As String

    If cell IsNot Nothing Then
        If cell.DataType IsNot Nothing Then
            index = Integer.Parse(cell.InnerText)
            returnValue = pSharedStringTable.SharedStringTable.ElementAt(index).InnerText
        Else
            returnValue = CStr(cell.InnerText)
        End If
    End If

    Return returnValue
End Function

Solution

  • This line looks suspcious:

    Dim cell As Cell = pWorksheetPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault()
    

    If the .Where() condition executes against every cell in the spreadsheet. As the number of rows grows, the number of cell address comparisons grows by (rows x columns). Even if the cell reference comparison operation is trivially simple, it adds up fast.

    If the OpenXML or the Workbook class shown do not provide convenient x,y cell addressing, you may have to create your own indexes. Make one pass over all the cells to add them to your own column lists, then you can index by x,y with abandon. x = index of the column list from the list of columns, y = index into the column list to fetch the cell.