I recently contributed to openpyxl and it was a really enjoyable library to contribute to.
However, at least for me, there were some aspects that slowed down my initial progress, as I had to learn some new tools and concepts etc. Of course, if you are motivated to contribute you will figure these things out, but I could have contributed sooner if someone had curated a list of tips/instructions/FAQs etc.
Therefore, I decided to open an issue for this on openpyxl (after I submitted by contribution - MR 384, to address issue 1003) under issue 1530. While I was planning what to write and in what format, I also thought that it would be good to also address this on stackoverflow, for a few different reasons. Firstly, I always search stackoverflow when I get stuck on something related to programming, so others might do the same. Secondly, It's a good way to test the water and see what people think about this Q&A, before committing it to heptapod/openpyxl. And lastly, it might give some people the motivation to not just consume open source and stackoverflow but to also contribute to them :)
Here is my step by step guide to getting set-up to commit to openpyxl
The beauty of stackoverflow is that, if something is not clear, it's easy to comment and let me know, and I should be able to address it directly here. Thus, the answer should get better over time and provide a simple alternative channel to ask questions about openpyxl contribution.
$ hg config --edit # add these three lines:
[extensions]
evolve =
topic =
open_workbook()
and follow/trace them through the codebase to understand how it is all connected.hg topic topic-name
)$ hg clone https://foss.heptapod.net/openpyxl/openpyxl openpyxl
$ cd openpyxl
$ hg pull # not needed
$ hg up 3.1 # because it's a new feature
$ hg topic workbook-customDocProps
# write your code...
$ hg pull # pull remote changes (seems to behave like git fetch)
$ hg merge -r 3.1 # merge changes from the remote into your topic branch
$ hg addremove # add any new files, and remove any you deleted
$ hg forget TempScripts/** # because I added some temp scripts during investigation, not to be committed
$ hg commit -m ":heavy_plus_sign: implemented wb.custom_doc_props, with tests and docs"
$ hg push https://{username}:{accessToken}@foss.heptapod.net/openpyxl/openpyxl
hg up
is an alias for hg update
or hg checkout
https://{username}:{accessToken}@foss.heptapod.net/openpyxl/openpyxl
) https://heptapod.net/pages/tuto-repo-http-access-token.htmlThe descriptors
folder, contains classes that enable a type system (with classes such as Typed
) and enable serialization and de-serialization to and from XML & string representations of XML (with classes such as Serlializable
with methods to_tree()
and from_tree()
). A type system is important, because the OOXML specification is fairly strict, and since python is a dynamic language, we need a way to enforce types in a scalable way. Also, since an .xlsx or .xlsm (post 2007) workbook is essentially just a Zip file with XML files inside it, of course XML serialization and de-serialization is critical.
The workbook
folder, contains the Workbook
class that is the python representation of a workbook. The properties of this class, correspond to the properties of the real workbook
The packaging
folder, contains classes that are used to store the workbook document properties (such as author, time created, etc.)
The reader
folder, contains classes that are used to read workbooks from disk, into memory (into a Workbook
class)
The writer
folder, contains classes that are used to write the in-memory workbook (in a Workbook
class) to disk
There are many more folders and classes, but until I work on those areas, I won't discuss them. I will, however, come back and update this, once I do.
Folder structure of the library, after you clone it locally:
(Speaking only about the post 2007/10 .xlsx and .xlsm varieties)
The workbook structure is below, for a fairly simple workbook. Usually, the root folder contains only one file [Content_Types].xml
and this is also referred to as the "manifest". This contains a list of default types and override types. Essentially, as you add more and more features to the Workbook, this list of types expands. For example, I implemented the support for Workbook.CustomDocumentProperties, and these are saved in docProps/custom.xml
and as soon as you add one CustomDocumentProperty, Excel adds this file, but also adds a type override to the manifest, to let Excel know, what type of XML file this is:
<?xml version="1.0"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Default Extension="bin" ContentType="application/vnd.ms-office.vbaProject"/>
<Override PartName="/xl/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>
<Override PartName="/xl/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
<Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/workbook.xml" ContentType="application/vnd.ms-excel.sheet.macroEnabled.main+xml"/>
</Types>
Additionally, in the root folder, there are usually 3 subfolders, _rels
, docProps
and xl
. The folder _rels
is often called the "root rels" or "root relationships". This folder contains a file _rels/.rels
which is an XML relationship file for the Document as a whole. These "rels" files (or relationship files) feature throughout the OOXML format, and in short, they explain how one XML file might be linked to another XML file or binary file. For example, in this simple .xslm file, we have some CustomDocumentProperties (optional) as well as a the default properties ("docProps/core.xml" and "docProps/app.xml") and the root rels, looks like this:
<?xml version="1.0"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="xl/workbook.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Id="rId1"/>
<Relationship Target="docProps/core.xml" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Id="rId2"/>
<Relationship Target="docProps/app.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" Id="rId3"/>
<Relationship Target="docProps/custom.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Id="rId4"/>
</Relationships>
You see in the above XML, that the Workbook itself is stored in "xl/workbook.xml", so you can think of an Excel file such as Document > Workbook > Worksheets
in terms of hierarchy. Every Workbook needs at least one Worksheet, as we have here. Notice also that Themes also fit into the hierarchy such as Document > Workbook > Themes
as seen in the folder structure below.
Equally, you can infer a hierarchy such as Document > Styles
and Document > vbaProject
, which means they are at the same hierarchy level as the Workbook
.
Diving deeper into the structure, you see again the _rels
folder, this time one level deeper at xl/_rels/workbook.xml.rels
and this means it is the relationships for the Workbook.xml file:
<?xml version="1.0"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="/xl/worksheets/sheet1.xml" Id="rId1"/>
<Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml" Id="rId2"/>
<Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml" Id="rId3"/>
<Relationship Type="http://schemas.microsoft.com/office/2006/relationships/vbaProject" Target="vbaProject.bin" Id="rId4"/>
</Relationships>
The xl/workbook.xml
file itself is as such (and even though the styles, theme and vbaProject are not listed here, the relationships shown above are important, so that Excel knows these files exist):
<?xml version="1.0"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<workbookPr codeName="ThisWorkbook"/>
<bookViews>
<workbookView visibility="visible" minimized="0" showHorizontalScroll="1" showVerticalScroll="1" showSheetTabs="1" xWindow="-120" yWindow="-120" windowWidth="29040" windowHeight="15840" tabRatio="600" firstSheet="0" activeTab="0" autoFilterDateGrouping="1"/>
</bookViews>
<sheets>
<sheet xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" name="Sheet1" sheetId="1" state="visible" r:id="rId1"/>
</sheets>
<definedNames/>
<calcPr calcId="191029" fullCalcOnLoad="1"/>
</workbook>
Finally, looking into the xl/worksheets/sheet1.xml
we see how the sheet data is saved. Here we just have the word "test" saved in cell B1:
<?xml version="1.0"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<sheetPr codeName="Sheet1">
<outlinePr summaryBelow="1" summaryRight="1"/>
<pageSetUpPr/>
</sheetPr>
<dimension ref="B1:B1"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="B3" sqref="B3"/>
</sheetView>
</sheetViews>
<sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>
<sheetData>
<row r="1">
<c r="B1" s="1" t="inlineStr"><is><t>Test</t></is></c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
Currently, Sheet1 has no relationships, but as we add more features into the sheet, this relationship file would be created at xl/worksheets/_rels/sheet1.xml.rels
and inform us, what extra data (as XML files or sometimes .bin files) is related to sheet1.
If you rename an excel file such as outfile.xlsm
to outfile.zip
, you can then extract all files and see the structure of the OOXML document: