Search code examples
.netvb.netopenxml

Reading a large Excel file by OpenXML


I am working on a project in which i am reading few cells from an excel file using OpenXML SDK in VB.Net and storing it in DataTable. It works just fine for the medium sized and small files however when when i try to open a large file of size 107MB, i get an OutOfMemory exception after reading through few sheets. I am able to open the file by double clicking(it takes a while though) Below are the code that i am using. Please let me know if i can improve the process of reading by reducing the memory consumption

Dim CellRage As String() = {"AG65", "AG281", "AG335", "AG389", "AG443", "AG497", "AG551", "AG800", "AG913", "AG1081", "AG1165", "AG1305"}
Dim CellValue(13) As String

Using myWorkbook As SpreadsheetDocument = SpreadsheetDocument.Open(stream, False)
     workbookPart = myWorkbook.WorkbookPart

     For Each worksheetpart As WorksheetPart In workbookPart.WorksheetParts
       For count As Integer = 0 To CellRage.GetUpperBound(0) -1
             CellValue(count) = CellValue(workbookPart, sheetName, CostCellRage(count - 2)) 
       Next
       'After few sheets throws OutofMemory Exception
     Next
End Using

Private Shared Function GetCellValue(workbookPart As WorkbookPart, sheetName As String, cellAddress As String) As String
        Dim theCell As Cell
        Dim wsPart As WorksheetPart
        Dim worksheet As Sheet
        Dim value As String
        Dim stringTablePart As SharedStringTablePart = workbookPart.SharedStringTablePart

        worksheet = workbookPart.Workbook.Descendants(Of Sheet).Where(Function(s) s.Name = sheetName).FirstOrDefault
        wsPart = CType(workbookPart.GetPartById(worksheet.Id), WorksheetPart)
        theCell = wsPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault

        If theCell.ChildElements.Count = 0 Then
            Return ""
        End If

        value = theCell.CellValue.Text

        If (theCell.DataType IsNot Nothing) AndAlso (theCell.DataType.ToString() = "s") Then
            value = stringTablePart.SharedStringTable.ChildElements(Int32.Parse(value)).InnerText
        End If

        Return value

    End Function

Thanks for looking into this


Solution

  • The problem you have is that your code is reading each sheet into memory which is eventually leading to you using too much memory.

    As Jesper points out you can use a SAX approach rather than the DOM approach you are currently using. To read an Excel document using OpenXML with a SAX approach you can use the OpenXmlReader class. This will read the file in a more memory efficient way by handling sections of the XML at a time rather than the whole DOM which should allow you to process large files.

    The OpenXmlReader reads the XML contents of the file part in XML chunks; think of it like reading a file using a Stream. We can't jump to an arbitrary cell as we haven't read the whole document yet. Instead what we need to do is read each row and ignore the rows we don't want. Once we have the row we are interested in we have the entire row XML available so at this point we can jump directly to the cells within that row that we are interested in.

    Dim desiredColumnNumber As Integer = 33
    Dim cellRange As Integer() = New Integer() {65, 281, 335, 389, 443, 497, _
    551, 800, 913, 1081, 1165, 1305}
    
    Using reader As OpenXmlReader = OpenXmlReader.Create(worksheetPart)
        While reader.Read()
            'we want to find the first row
            If reader.ElementType = GetType(Row) Then
                Do
                    If Not cellRange.Contains(Convert.ToInt32(CUInt(row.RowIndex))) Then
                        'we're not interested in this row so continue
                        Continue Do
                    End If
    
                    Dim row As Row = DirectCast(reader.LoadCurrentElement(), Row)
    
                    If row.HasChildren Then
                        'get cell in column AG
                        Dim cell As Cell = DirectCast(row.ChildElements(desiredColumnNumber), Cell)
                        'do something with the cell...
                    End If
                    'move to the next row
                Loop While reader.ReadNextSibling()
            End If
        End While
    End Using
    

    In the code above I have split your cell ranges into a cell number (desiredCellNumber) and an array of Integer (cellRange) that stores the number of each row we are interested in (if you can't split this in your original code you'll have to convert each cell reference into the correct format).

    After that we then check the RowIndex property on each row to see if it exists in the cellRange. If not then we move on to the next row but if it is a row we are interested we can access the cells by accessing the Row.ChildElements property. In the code above we are only interested in column 33 so we read that cell value using row.ChildElements(desiredColumnNumber).

    As an aside, when I'm dealing with large files like this I usually read the SharedStringTablePart into a Dictionary or similar first and then read from the there when I need to get a string value. Obviously the amount of memory this consumes depends heavily on the contents of your Excel file so you might want to do something cleverer to minimise the memory used when storing the Shared Strings. The code for reading the shared strings is (almost) identical to the above:

    Dim sharedStrings As New Dictionary(Of Integer, String)()
    If stringTablePart IsNot Nothing Then
        Using reader As OpenXmlReader = OpenXmlReader.Create(stringTablePart)
            Dim i As Integer = 0
            While reader.Read()
                If reader.ElementType = GetType(SharedStringItem) Then
                    Dim sharedStringItem As SharedStringItem = DirectCast(reader.LoadCurrentElement(), SharedStringItem)
                    sharedStrings.Add(i), If(sharedStringItem.Text IsNot Nothing, sharedStringItem.Text.Text, String.Empty))
                    i = i + 1
                End If
            End While
        End Using
    End If