Search code examples
c#c++excelxmllibreoffice

Excel performance on Pivot-table aggregations


In my usage of Excel, I'm always surprised at how well Excel does the following two aggregation operations:

  1. Date/Time aggregations.
  2. Case-insensitive aggregations.

How is Excel able to achieve that performance? Do they store additional data structures for pivot-related information and aggregations? Is this documented anywhere or where could I find out more about that? I've looked at the Libreoffice source code, but the actual product isn't even close to Excel in aggregation/pivot performance.


It would be great if someone who understood Excel could share a bit more about the low-level aggregation behavior or structs that Excel uses to achieve this performance -- for example, are they storing any labels twice -- once in its native case and once lower-ed for aggregation purposes? While I know this question is overly broad and not about a code answer per se, and its more conceptual, I'm hoping that an answer may serve as a good reference for methods to optimize performance on excel-style aggregations.


Here are a few things that I've noticed based on some suggestions from ARGeo --

(1) There are two files related to Pivot Cache -- Definitions (field-level info):

enter image description here

(2) And Records (row/cell level info) --

enter image description here

A few questions from this then:

  • How does Excel determine when to store the value as-is vs when to store it as a Shared record. For example, why is value in B2, "LifeLock" (a mixed-case string) stored as-is, but the value in F2, "AZ" is stored as in sharedItems (v="0")?
  • Is there any information on the internal C/C++ Struct that Excel uses in-memory for its pivotCache (rather than as various XML docs that are stored)?
  • Is there any information on how the "helper information" stored at the field-level is used internally by Excel? For example, this information:

.

<cacheField name="numEmps" numFmtId="0"><sharedItems containsString="0" containsBlank="1" containsNumber="1" containsInteger="1" minValue="0" maxValue="20000"/></cacheField>


Solution

  • Pivot-table performance is based on Pivot Cache. Although there is very little information on this subject (I mean the lack of official documentation), I've found some interesting posts and MS documentation.

    Definition:

    Pivot Cache is a special memory area where pivot table records are saved.

    enter image description here

    When you create a Pivot Table, Excel takes a copy of the source data and stores it in the Pivot Cache. The Pivot Cache is held in Excel’s memory. You can’t see it but that’s the data the Pivot Table references when you build your Pivot Table.

    This enables Excel to be very responsive to changes in the Pivot Table but it can also double the size of your file. After all, the Pivot Cache is just a duplicate of your source data so it makes sense that your file size will potentially double.

    Please, use this link and this link for further information as a starting reference point.

    Also, you can read Pivot Cache in Excel 101 and Excel Pivot Cache 101 posts to find out what it is and what side effects it has.

    Here are some VB code snippets and examples how to use PivotCache object.

    Here's a code written in C# that allows you to create an Excel workbook with some Pivot Tables which, of course, use Pivot Cache:

    System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Reflection;
    using Excel = Microsoft.Office.Interop.Excel;
    using System.IO;
    using System.Diagnostics;
    using System.Configuration;
    using System.Data.SqlClient;
    using System.Data;
     
    namespace ConsoleApplication1 {
    
        class Program {
     
            static void Main(string[] args) {
     
                Excel.Application objApp;
                Excel.Workbook objBook;
                Excel.Sheets objSheets;
                Excel.Workbooks objBooks;
     
                string command = (@"SELECT * FROM dbo.Client");
     
                using (SqlConnection connection = new SqlConnection(GetConnectionStringByName("CubsPlus"))) {
    
                    DataTable data = new DataTable();
    
                    try {
                        connection.Open();
                    }
                    catch (Exception e) {
                        StackTrace st = new StackTrace(new StackFrame(true));
                        StackFrame sf = st.GetFrame(0);
                        Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
                    }
                    try {
                        data = DataTools.SQLQueries.getDataTableFromQuery(connection, command);
     
                        if (data == null) {
                            throw new ArgumentNullException();
                        }
                    }
                    catch (Exception e) {
    
                        StackTrace st = new StackTrace(new StackFrame(true));
                        StackFrame sf = st.GetFrame(0);
                        Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
                    }
     
                    objApp = new Excel.Application();
    
                    try {     
                        objBooks = objApp.Workbooks;
                        objBook = objApp.Workbooks.Add(Missing.Value);
                        objSheets = objBook.Worksheets;
     
                        Excel.Worksheet sheet1 = (Excel.Worksheet)objSheets[1];
                        sheet1.Name = "ACCOUNTS";
                        string message = DataTools.Excel.copyDataTableToExcelSheet(data, sheet1);
    
                        if (message != null) {
                            Console.WriteLine("Problem importing the data to Excel");
                            Console.WriteLine(message);
                            Console.ReadLine();
                        }
                             
                        //CREATE A PIVOT CACHE BASED ON THE EXPORTED DATA
                        Excel.PivotCache pivotCache = objBook.PivotCaches().Add(Excel.XlPivotTableSourceType.xlDatabase,sheet1.UsedRange);
     
                        Console.WriteLine(pivotCache.SourceData.ToString());
                        
                        Console.ReadLine();
     
                        //WORKSHEET FOR NEW PIVOT TABLE
                        Excel.Worksheet sheet2 = (Excel.Worksheet)objSheets[2];
                        sheet2.Name = "PIVOT1";
                        
                        //PIVOT TABLE BASED ON THE PIVOT CACHE OF EXPORTED DATA
                        Excel.PivotTables pivotTables = (Excel.PivotTables)sheet2.PivotTables(Missing.Value);
                        Excel.PivotTable pivotTable = pivotTables.Add(pivotCache, objApp.ActiveCell, "PivotTable1", Missing.Value, Missing.Value);
     
                        pivotTable.SmallGrid = false;
                        pivotTable.TableStyle = "PivotStyleLight1";
     
                        //ADDING PAGE FIELD
                        Excel.PivotField pageField = (Excel.PivotField)pivotTable.PivotFields("ParentName");
                        pageField.Orientation = Excel.XlPivotFieldOrientation.xlPageField;
     
                        //ADDING ROW FIELD
                        Excel.PivotField rowField = (Excel.PivotField)pivotTable.PivotFields("State");
                        rowField.Orientation = Excel.XlPivotFieldOrientation.xlRowField;
     
                        //ADDING DATA FIELD
                        pivotTable.AddDataField(pivotTable.PivotFields("SetupDate"), "average setup date", Excel.XlConsolidationFunction.xlAverage);
     
                        ExcelSaveAs(objApp, objBook, @"J:\WBK");
     
                        objApp.Quit();
                    }     
                    catch (Exception e) {
    
                        objApp.Quit();
                        Console.WriteLine(e.Message);
                        Console.ReadLine();
                    }
                }
            }
     
            static string ExcelSaveAs(Excel.Application objApp, Excel.Workbook objBook, string path) {
                try {
                    objApp.DisplayAlerts = false;
                    objBook.SaveAs(path, Excel.XlFileFormat.xlExcel7, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Excel.XlSaveAsAccessMode.xlNoChange, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Missing.Value);
                    objApp.DisplayAlerts = true;
                    return null;
                }
                catch (Exception e) {
                    StackTrace st = new StackTrace(new StackFrame(true));
                    StackFrame sf = st.GetFrame(0);
                    return (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
                }
            }
            static string GetConnectionStringByName(string name) {
                //ASSUME FAILURE
                string returnValue = null;
     
                //Look for the name in the connectionStrings section
                ConnectionStringSettings settings = ConfigurationManager.ConnectionStrings[name];
     
                // If found, return the connection string
                if (settings != null) {
                    returnValue = settings.ConnectionString;
                }
                return returnValue;
            }
        }
    }
    

    And here's a code written in VB that allows us to create a new Pivot Cache for selected Pivot Table:

    Sub SelPTNewCache()
    
        Dim wsTemp As Worksheet
        Dim pt As PivotTable
        
        On Error Resume Next
        Set pt = ActiveCell.PivotTable
        
        If pt Is Nothing Then
            MsgBox "Active cell is not in a pivot table"
        Else
            Set wsTemp = Worksheets.Add
            
            ActiveWorkbook.PivotCaches.Create( _
                SourceType:=xlDatabase, _
                SourceData:=pt.SourceData).CreatePivotTable _
                TableDestination:=wsTemp.Range("A3"), _
                TableName:="PivotTableTemp"
            
            pt.CacheIndex = wsTemp.PivotTables(1).CacheIndex
            
            Application.DisplayAlerts = False
            wsTemp.Delete
            Application.DisplayAlerts = True
        End If
        
    exitHandler:
            Set pt = Nothing
    
    End Sub
    

    enter image description here

    1. In your asd.js file there are the following elements:

    s stands for a string value

    n stands for a numeric value

    d stands for a date value

    x stands for an index value

    v indicates a value itself

    So, let's translate in human language a data contained in F2 cell of this table:

    enter image description here

    <x v="0"/>
    

    A value of 0 is an zero index in array of Strings where abbreviations of USA states are stored. The first index in this array retrieves Arizona for us. I don't know why the cell in next row contains lowercased az and all the others contain uppercased AZ but I'm sure it's not about Shared Record.

    2. I haven't found any useful information on the internal C/C++ Struct that Excel uses in-memory for its pivotCache.

    And, at last:

    3. Here's a LINK containing useful info regarding "helper information" in third extra question.

    P.S.

    About Big O notation.

    Big O notation is used in Computer Science to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (in memory or on disk) by an algorithm. Big O notation is a measure of the complexity of your program in terms of the size of the input.

    • O(1) stands for algorithm that always executes in the same time regardless of the size of the input data set.

    • O(N) stands for algorithm whose performance grows linearly and in direct proportion to the size of the input data set.

    • O(N*N) stands for algorithm whose performance is directly proportional to the size's square of the input data set.

    • T(N) = O(log N) stands for algorithm whose performance depends on logarithmic time. Algorithms taking logarithmic time are commonly found in operations on binary trees or when using binary search.

    But good sorting algorithms are harshly O(N log N). An algorithm's example with this efficiency can be merge sort, which breaks up an array into two halves, sorts those two halves by recursively calling itself on them, and then merging the result back into a single array.

    Here's an abstract C# code's snippet showing how O(N log N) algorithm works (approximately the same approach can be used for creating a pivot table):

    public static int[] MergeSort(int[] inputItems, int lowerBound, int upperBound) {
        if (lowerBound < upperBound) {
            int middle = (lowerBound + upperBound) / 2;
            MergeSort(inputItems, lowerBound, middle);
            MergeSort(inputItems, middle + 1, upperBound);
     
            int[] leftArray = new int[middle - lowerBound + 1];
            int[] rightArray = new int[upperBound - middle];
     
            Array.Copy(inputItems, lowerBound, leftArray, 0, middle - lowerBound + 1);
            Array.Copy(inputItems, middle + 1, rightArray, 0, upperBound - middle);
     
            int i = 0;
            int j = 0;
            for (int count = lowerBound; count < upperBound + 1; count++) {
                if (i == leftArray.Length) {
                    inputItems[count] = rightArray[j];
                    j++;
                }
                else if (j == rightArray.Length) {
                    inputItems[count] = leftArray[i];
                    i++;
                }
                else if (leftArray[i] <= rightArray[j]) {
                    inputItems[count] = leftArray[i];
                    i++;
                }
                else {
                    inputItems[count] = rightArray[j];
                    j++;
                }
            }
        }
        return inputItems;
    }