In my usage of Excel, I'm always surprised at how well Excel does the following two aggregation operations:
How is Excel able to achieve that performance? Do they store additional data structures for pivot-related information and aggregations? Is this documented anywhere or where could I find out more about that? I've looked at the Libreoffice source code, but the actual product isn't even close to Excel in aggregation/pivot performance.
It would be great if someone who understood Excel could share a bit more about the low-level aggregation behavior or structs that Excel uses to achieve this performance -- for example, are they storing any labels twice -- once in its native case and once lower-ed for aggregation purposes? While I know this question is overly broad and not about a code answer per se, and its more conceptual, I'm hoping that an answer may serve as a good reference for methods to optimize performance on excel-style aggregations.
Here are a few things that I've noticed based on some suggestions from ARGeo --
(1) There are two files related to Pivot Cache -- Definitions (field-level info):
(2) And Records (row/cell level info) --
A few questions from this then:
Struct
that Excel uses in-memory for its pivotCache (rather than as various XML docs that are stored)?.
<cacheField name="numEmps" numFmtId="0"><sharedItems containsString="0" containsBlank="1" containsNumber="1" containsInteger="1" minValue="0" maxValue="20000"/></cacheField>
Pivot-table performance is based on Pivot Cache
. Although there is very little information on this subject (I mean the lack of official documentation), I've found some interesting posts and MS documentation.
Definition:
Pivot Cache
is a special memory area where pivot table records are saved.
When you create a
Pivot Table
, Excel takes a copy of the source data and stores it in thePivot Cache
. ThePivot Cache
is held in Excel’s memory. You can’t see it but that’s the data the Pivot Table references when you build your Pivot Table.
This enables Excel to be very responsive to changes in the Pivot Table but it can also double the size of your file
. After all, the Pivot Cache is just a duplicate of your source data so it makes sense that your file size will potentially double.
Please, use this link and this link for further information as a starting reference point.
Also, you can read Pivot Cache in Excel 101 and Excel Pivot Cache 101 posts to find out what it is and what side effects it has.
Here are some VB code snippets and examples how to use PivotCache object.
Here's a code written in C# that allows you to create an Excel workbook with some Pivot Tables
which, of course, use Pivot Cache
:
System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Reflection;
using Excel = Microsoft.Office.Interop.Excel;
using System.IO;
using System.Diagnostics;
using System.Configuration;
using System.Data.SqlClient;
using System.Data;
namespace ConsoleApplication1 {
class Program {
static void Main(string[] args) {
Excel.Application objApp;
Excel.Workbook objBook;
Excel.Sheets objSheets;
Excel.Workbooks objBooks;
string command = (@"SELECT * FROM dbo.Client");
using (SqlConnection connection = new SqlConnection(GetConnectionStringByName("CubsPlus"))) {
DataTable data = new DataTable();
try {
connection.Open();
}
catch (Exception e) {
StackTrace st = new StackTrace(new StackFrame(true));
StackFrame sf = st.GetFrame(0);
Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
}
try {
data = DataTools.SQLQueries.getDataTableFromQuery(connection, command);
if (data == null) {
throw new ArgumentNullException();
}
}
catch (Exception e) {
StackTrace st = new StackTrace(new StackFrame(true));
StackFrame sf = st.GetFrame(0);
Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
}
objApp = new Excel.Application();
try {
objBooks = objApp.Workbooks;
objBook = objApp.Workbooks.Add(Missing.Value);
objSheets = objBook.Worksheets;
Excel.Worksheet sheet1 = (Excel.Worksheet)objSheets[1];
sheet1.Name = "ACCOUNTS";
string message = DataTools.Excel.copyDataTableToExcelSheet(data, sheet1);
if (message != null) {
Console.WriteLine("Problem importing the data to Excel");
Console.WriteLine(message);
Console.ReadLine();
}
//CREATE A PIVOT CACHE BASED ON THE EXPORTED DATA
Excel.PivotCache pivotCache = objBook.PivotCaches().Add(Excel.XlPivotTableSourceType.xlDatabase,sheet1.UsedRange);
Console.WriteLine(pivotCache.SourceData.ToString());
Console.ReadLine();
//WORKSHEET FOR NEW PIVOT TABLE
Excel.Worksheet sheet2 = (Excel.Worksheet)objSheets[2];
sheet2.Name = "PIVOT1";
//PIVOT TABLE BASED ON THE PIVOT CACHE OF EXPORTED DATA
Excel.PivotTables pivotTables = (Excel.PivotTables)sheet2.PivotTables(Missing.Value);
Excel.PivotTable pivotTable = pivotTables.Add(pivotCache, objApp.ActiveCell, "PivotTable1", Missing.Value, Missing.Value);
pivotTable.SmallGrid = false;
pivotTable.TableStyle = "PivotStyleLight1";
//ADDING PAGE FIELD
Excel.PivotField pageField = (Excel.PivotField)pivotTable.PivotFields("ParentName");
pageField.Orientation = Excel.XlPivotFieldOrientation.xlPageField;
//ADDING ROW FIELD
Excel.PivotField rowField = (Excel.PivotField)pivotTable.PivotFields("State");
rowField.Orientation = Excel.XlPivotFieldOrientation.xlRowField;
//ADDING DATA FIELD
pivotTable.AddDataField(pivotTable.PivotFields("SetupDate"), "average setup date", Excel.XlConsolidationFunction.xlAverage);
ExcelSaveAs(objApp, objBook, @"J:\WBK");
objApp.Quit();
}
catch (Exception e) {
objApp.Quit();
Console.WriteLine(e.Message);
Console.ReadLine();
}
}
}
static string ExcelSaveAs(Excel.Application objApp, Excel.Workbook objBook, string path) {
try {
objApp.DisplayAlerts = false;
objBook.SaveAs(path, Excel.XlFileFormat.xlExcel7, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Excel.XlSaveAsAccessMode.xlNoChange, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Missing.Value);
objApp.DisplayAlerts = true;
return null;
}
catch (Exception e) {
StackTrace st = new StackTrace(new StackFrame(true));
StackFrame sf = st.GetFrame(0);
return (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
}
}
static string GetConnectionStringByName(string name) {
//ASSUME FAILURE
string returnValue = null;
//Look for the name in the connectionStrings section
ConnectionStringSettings settings = ConfigurationManager.ConnectionStrings[name];
// If found, return the connection string
if (settings != null) {
returnValue = settings.ConnectionString;
}
return returnValue;
}
}
}
And here's a code written in VB that allows us to create a new Pivot Cache
for selected Pivot Table
:
Sub SelPTNewCache()
Dim wsTemp As Worksheet
Dim pt As PivotTable
On Error Resume Next
Set pt = ActiveCell.PivotTable
If pt Is Nothing Then
MsgBox "Active cell is not in a pivot table"
Else
Set wsTemp = Worksheets.Add
ActiveWorkbook.PivotCaches.Create( _
SourceType:=xlDatabase, _
SourceData:=pt.SourceData).CreatePivotTable _
TableDestination:=wsTemp.Range("A3"), _
TableName:="PivotTableTemp"
pt.CacheIndex = wsTemp.PivotTables(1).CacheIndex
Application.DisplayAlerts = False
wsTemp.Delete
Application.DisplayAlerts = True
End If
exitHandler:
Set pt = Nothing
End Sub
1. In your
asd.js
file there are the following elements:
– s
stands for a string value
– n
stands for a numeric value
– d
stands for a date value
– x
stands for an index value
– v
indicates a value itself
So, let's translate in human language a data contained in F2
cell of this table:
<x v="0"/>
A value of 0
is an zero index
in array of Strings where abbreviations of USA states are stored. The first index in this array retrieves Arizona
for us. I don't know why the cell in next row contains lowercased az
and all the others contain uppercased AZ
but I'm sure it's not about Shared Record
.
2. I haven't found any useful information on the internal C/C++ Struct that Excel uses in-memory for its pivotCache.
And, at last:
3. Here's a LINK containing useful info regarding "helper information" in third extra question.
P.S.
About Big O notation.
Big O notation is used in Computer Science to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (in memory or on disk) by an algorithm.
Big O notation
is a measure of the complexity of your program in terms of the size of the input.
O(1)
stands for algorithm that always executes in the same time regardless of the size of the input data set.
O(N)
stands for algorithm whose performance grows linearly and in direct proportion to the size of the input data set.
O(N*N)
stands for algorithm whose performance is directly proportional to the size's square of the input data set.
T(N) = O(log N)
stands for algorithm whose performance depends on logarithmic time. Algorithms taking logarithmic time are commonly found in operations on binary trees or when using binary search.
But good sorting algorithms are harshly O(N log N)
. An algorithm's example with this efficiency can be merge sort, which breaks up an array into two halves, sorts those two halves by recursively calling itself on them, and then merging the result back into a single array.
Here's an abstract C# code's snippet showing how O(N log N)
algorithm works (approximately the same approach can be used for creating a pivot table):
public static int[] MergeSort(int[] inputItems, int lowerBound, int upperBound) {
if (lowerBound < upperBound) {
int middle = (lowerBound + upperBound) / 2;
MergeSort(inputItems, lowerBound, middle);
MergeSort(inputItems, middle + 1, upperBound);
int[] leftArray = new int[middle - lowerBound + 1];
int[] rightArray = new int[upperBound - middle];
Array.Copy(inputItems, lowerBound, leftArray, 0, middle - lowerBound + 1);
Array.Copy(inputItems, middle + 1, rightArray, 0, upperBound - middle);
int i = 0;
int j = 0;
for (int count = lowerBound; count < upperBound + 1; count++) {
if (i == leftArray.Length) {
inputItems[count] = rightArray[j];
j++;
}
else if (j == rightArray.Length) {
inputItems[count] = leftArray[i];
i++;
}
else if (leftArray[i] <= rightArray[j]) {
inputItems[count] = leftArray[i];
i++;
}
else {
inputItems[count] = rightArray[j];
j++;
}
}
}
return inputItems;
}