Search code examples
.netperformanceexceloffice-interop

How to make this custom worksheet initialization faster?


Summary

This question is somehow the follow-up to this question:
How to implement column self-naming from its index?

Having tested the code provided in this above-linked question's answers, I finally encountered a serious performance issue.

Performance issue

The performance issue occurs upon a Sheet initialization, that is, when I initialize the Sheet's Cells.

    ''' <summary>
    ''' Initialize an instance of the Company.Project.Sheet class.
    ''' </summary>
    ''' <param name="nativeSheet">The native worksheet from which to initialize.</param>
    Friend Sub New(ByVal nativeSheet As Microsoft.Office.Interop.Excel.Worksheet)
        _nativeSheet = nativeSheet
        Dim cells As IDictionary(Of String, ICell) = New Dictionary(Of String, ICell)()

        'These iterations hurt the performance of the API...'
        For rowIndex As Integer = 1 To _nativeSheet.Rows.Count Step 1
            For colIndex As Integer = 1 To _nativeSheet.Columns.Count Step 1
                Dim c As ICell = New Cell(_nativeSheet.Cells(rowIndex, colIndex))
                cellules.Add(c.Name, c)
            Next
        Next

        _cellules = New ReadOnlyDictionary(Of String, ICell)(cells)
    End Sub
  • ReadOnlyDictionary(Of TKey, TValue) :
    A custom read-only dictionary that simply wraps a IDictionary(Of TKey, TValue) to prevent modifications.

Discussion

I'm working this way since each cell in an underlying spreadsheet worksheet is initialized from the initialization of the worksheet until the end, that is, when the worksheet is disposed or finalized. Hence, the same way I wish to initialize the cells of a Sheet, but I also wish to keep the performance boost of using the indexed cells over the named ("A1") cells, while keeping the ease of use to the API user to refer to a cell with its name, that is how I intend to use the dictionary, so that when I refer to cell "A1", I access this key into my dictionary and address the cell (1, 1) accordingly.

  • Aside, I know of an even faster way to read from a worksheet using the Worksheet.UsedRange property that returns all of the used cells into a 2D matrix.

    If there was anyhow the same or about the same for the set of cells with which I could initialize multiple instances of my Cell class with, this would be great, and performant!

  • I also thought of initializing like only a 100 x 100 matrix cells in memory while mapping them with my dictionary, as one will rarely use the whole sheet's cells. As such, I am still thinking of a way where I would have to access a not yet initialized cell, let's say Cells(120, 120). Ideally, I think, the program would have to initialize all the cells between the maximum initially initialized Cell(100, 100) until Cell (120, 120). Am I clear enough here? Feel free to ask for clarification! =)

  • Another option could be that I only initialize the cells' names into the dictionary and keeping there row and column index in memory, not initializing a Cell instance with its nativeCell, say a Range. Here's the code of my Cell class to illustrate what I mean.

    ''' ''' Represents a cell in a worksheet. ''' ''' Friend Class Cell Implements ICell

    Private _nativeCell As Microsoft.Office.Interop.Excel.Range
    Private _name As String
    
    ''' <summary>
    ''' Initializes a new instance of the Company.Project.Cell class.
    ''' </summary>
    ''' <param name="nativeCell">The Microsoft.Office.Interop.Excel.Range to wrap.</param>
    Friend Sub New(ByVal nativeCell As Microsoft.Office.Interop.Excel.Range)
        _nativeCell = nativeCell
    End Sub
    
    Public ReadOnly Property NativeCell() As Microsoft.Office.Interop.Excel.Range Implements ICellule.NativeCell
        Get
            Return _nativeCell 
        End Get
    End Property
    
    Public ReadOnly Property Column() As Integer Implements ICell.Column
        Get
            Return _nativeCell.Column
        End Get
    End Property
    
    Public ReadOnly Property Row() As Integer Implements ICell.Row
        Get
            Return _nativeCell.Row
        End Get
    End Property
    
    Public ReadOnly Property Name() As String Implements ICellule.Name
        Get
            If (String.IsNullOrEmpty(_name) OrElse _name.Trim().Length = 0) Then _
                _name = GetColumnName()
    
            Return _nom
        End Get
    End Property
    
    Public Property Value() As Object Implements ICellule.Value
        Get
            Return _nativeCell.Value2
        End Get
        Set(ByVal value As Object)
            _nativeCell.Value2 = value
        End Set
    End Property
    
    Public ReadOnly Property FormattedValue() As String Implements ICellule.FormattedValue
        Get
            Return _nativeCell.Text
        End Get
    End Property
    
    Public ReadOnly Property NumericValue() As Double? Implements ICellule.NumericValue
        Get
            Return Value
        End Get
    End Property
    

Questions

  1. What are my other options?

  2. Are there any other ways to walk through?

  3. Is there a way I can make the actual approach viable as for performance concerns?

For your information, this issue timed out on testing, so the test never ended within an acceptable time range which actually take centuries...

Any thoughts are welcome! I'm open minded to other solutions or approach that will help me achieve this objective while addressing this performance issue.

Thanks to you all! =)

EDIT #1

Thanks to Maxim Gueivandov, his solution solves the issue I have addressed in this question.

Aside, there's another problem that arose from this solution: SystemOutOfMemoryException, and that will be addressed in another question.

My Sincerest Thanks to Maxim Gueivandov.


Solution

  • You could try to get all cells in the used range in one hop, thus avoiding to call Cells(rowIndex, colIndex) on each iteration of iteration (I guess that Cells hides an interop call, which may have a performance impact).

    Dim usedRange As Range = nativeSheet.UsedRange
    Dim cells(,) As Object = DirectCast(usedRange.get_Value( _
        XlRangeValueDataType.xlRangeValueDefault), Object(,))
    [... do your row/col iterations ...]
    

    You'll find some performance tips on which I based these assumptions in the following article: C# Excel Interop Use. Most notably, check the benchmark part:

    === Excel interop benchmark in C# ===

    Cells[]: 30.0 seconds

    get_Range(), Cells[]: 15.0 seconds

    UsedRange, get_Value(): 1.5 seconds [fastest]