Search code examples
pythonmercurialopenpyxl

How can I contribute to openpyxl on https://foss.heptapod.net/openpyxl/


I recently contributed to and it was a really enjoyable library to contribute to.

However, at least for me, there were some aspects that slowed down my initial progress, as I had to learn some new tools and concepts etc. Of course, if you are motivated to contribute you will figure these things out, but I could have contributed sooner if someone had curated a list of tips/instructions/FAQs etc.

Therefore, I decided to open an issue for this on openpyxl (after I submitted by contribution - MR 384, to address issue 1003) under issue 1530. While I was planning what to write and in what format, I also thought that it would be good to also address this on stackoverflow, for a few different reasons. Firstly, I always search stackoverflow when I get stuck on something related to programming, so others might do the same. Secondly, It's a good way to test the water and see what people think about this Q&A, before committing it to heptapod/openpyxl. And lastly, it might give some people the motivation to not just consume open source and stackoverflow but to also contribute to them :)

Here is my step by step guide to getting set-up to commit to

The beauty of stackoverflow is that, if something is not clear, it's easy to comment and let me know, and I should be able to address it directly here. Thus, the answer should get better over time and provide a simple alternative channel to ask questions about contribution.


Solution

  • Step by step (more detail on each aspect below):

    • make an account on https://foss.heptapod.net/
    • download & install mercurial https://www.mercurial-scm.org/wiki/Download and hg-evolve https://pypi.org/project/hg-evolve/ (I noticed that the mercurial client [TortoiseHg], seems to default to using python27. This meant that when I installed hg-evolve on my python39 it did not seem to work but when I installed it on my python27 as well then the extension started to work with mercurial)
      • ensure to do this as well to register the hg-evolve extensions:
      $ hg config --edit # add these three lines:
      [extensions]
      evolve =
      topic = 
      
    • clone the openpyxl repo
    • request developer access
    • read section 8.13 of Python Cookbook 3rd Edition (Type checking classes, mixins, etc.)
    • read the codebase, try to start from the initial methods like open_workbook() and follow/trace them through the codebase to understand how it is all connected.
    • select an issue on https://foss.heptapod.net/openpyxl/openpyxl/-/issues, ideally one with some upvotes/positive comments and make a comment yourself to say you would like to implement/fix this issue.
    • read the OOMXL spec for the part you are looking to work on (might not be needed for some features/bugs/improvements to docs)
    • create a topic for the issue with your local hg (hg topic topic-name)
    • code - implement the feature/fix the bug/update docs etc.
    • write tests (if it's a new feature)
    • write docs (if it's a new feature)
    • hg addremove and hg commit all changes
    • push to foss (this will trigger a PR/MR)
    • wait for a review from a maintainer, answer any questions and address any requests (such as: please add test to cover xyz, try to make the demo in the docs more real-world, make sure you conform to PEP-8 style, etc.)
    • make any further changes, hg addremove, hg commit, hg push
    • MR is accepted - congrats!

    Some more details:

    Related to source control:

    • openpyxl is hosted on https://foss.heptapod.net/ and not github
    • openpyxl uses Mercurial instead of Git and requires hg-evolve to be installed with pip
    • Mercurial/hg uses "topics" instead of branches (but they are essentially the same) and you cannot fork the repo, you have to clone the repo locally and then create a topic for your work and add and commit all code to that topic.
    • Mercurial/hg uses some different commands than git, here is a common workflow (the example is from my contribution):
      $ hg clone https://foss.heptapod.net/openpyxl/openpyxl openpyxl
      $ cd openpyxl
      $ hg pull # not needed
      $ hg up 3.1 # because it's a new feature
      $ hg topic workbook-customDocProps
      # write your code...
      $ hg pull # pull remote changes (seems to behave like git fetch)
      $ hg merge -r 3.1 # merge changes from the remote into your topic branch
      $ hg addremove # add any new files, and remove any you deleted
      $ hg forget TempScripts/** # because I added some temp scripts during investigation, not to be committed
      $ hg commit -m ":heavy_plus_sign: implemented wb.custom_doc_props, with tests and docs"
      $ hg push https://{username}:{accessToken}@foss.heptapod.net/openpyxl/openpyxl
      
      • NOTE: hg up is an alias for hg update or hg checkout
    • There are some good resources for help such as:
    • you will need to create an access token to push to the repo (and you will need to push with the format I showed above such as https://{username}:{accessToken}@foss.heptapod.net/openpyxl/openpyxl) https://heptapod.net/pages/tuto-repo-http-access-token.html

    Related to the library itself:

    • It's quite a large library, with a lot of interconnected parts, so it takes a while to figure it out on your own.
    • Without reading section 8.13 of Python Cookbook 3rd Edition it can be hard to understand what all the classes and inheritance and mixins etc. are achieving in order to Serialize and De-serialize XML.
    • So, in order to try to address the previous 2 bullet points, I've tried to explain the library as best I can, from the parts of it I worked with:
      • The descriptors folder, contains classes that enable a type system (with classes such as Typed) and enable serialization and de-serialization to and from XML & string representations of XML (with classes such as Serlializable with methods to_tree() and from_tree()). A type system is important, because the OOXML specification is fairly strict, and since python is a dynamic language, we need a way to enforce types in a scalable way. Also, since an .xlsx or .xlsm (post 2007) workbook is essentially just a Zip file with XML files inside it, of course XML serialization and de-serialization is critical.

      • The workbook folder, contains the Workbook class that is the python representation of a workbook. The properties of this class, correspond to the properties of the real workbook

      • The packaging folder, contains classes that are used to store the workbook document properties (such as author, time created, etc.)

      • The reader folder, contains classes that are used to read workbooks from disk, into memory (into a Workbook class)

      • The writer folder, contains classes that are used to write the in-memory workbook (in a Workbook class) to disk

      • There are many more folders and classes, but until I work on those areas, I won't discuss them. I will, however, come back and update this, once I do.

        Folder structure of the library, after you clone it locally:

        enter image description here

    Related to Excel files themselves:

    (Speaking only about the post 2007/10 .xlsx and .xlsm varieties)

    • The Specification for the Open XML file format is pretty dry: http://www.ecma-international.org/news/TC45_current_work/Office%20Open%20XML%20Part%201%20-%20Fundamentals_final.docx but it is pretty important and necessary to implement new features correctly, according to the spec (and not by reverse engineering example Excel files you have - although that can help sometimes point you to the correct part of the specification etc.)
    • You need to understand how the manifest, rels, uuids, namespaces etc. all work together to store an excel workbook (with all its worksheets, images, charts etc.) into XML and in some case binary formats.
    • So, in order to try to address the previous 2 bullet points, I've tried to explain the OOXML format as best I can, from the parts of it I worked with:
      • The workbook structure is below, for a fairly simple workbook. Usually, the root folder contains only one file [Content_Types].xml and this is also referred to as the "manifest". This contains a list of default types and override types. Essentially, as you add more and more features to the Workbook, this list of types expands. For example, I implemented the support for Workbook.CustomDocumentProperties, and these are saved in docProps/custom.xml and as soon as you add one CustomDocumentProperty, Excel adds this file, but also adds a type override to the manifest, to let Excel know, what type of XML file this is:

        <?xml version="1.0"?>
        <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
          <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
          <Default Extension="xml" ContentType="application/xml"/>
          <Default Extension="bin" ContentType="application/vnd.ms-office.vbaProject"/>
          <Override PartName="/xl/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>
          <Override PartName="/xl/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
          <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
          <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
          <Override PartName="docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
          <Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
          <Override PartName="/xl/workbook.xml" ContentType="application/vnd.ms-excel.sheet.macroEnabled.main+xml"/>
        </Types>
        
      • Additionally, in the root folder, there are usually 3 subfolders, _rels, docProps and xl. The folder _rels is often called the "root rels" or "root relationships". This folder contains a file _rels/.rels which is an XML relationship file for the Document as a whole. These "rels" files (or relationship files) feature throughout the OOXML format, and in short, they explain how one XML file might be linked to another XML file or binary file. For example, in this simple .xslm file, we have some CustomDocumentProperties (optional) as well as a the default properties ("docProps/core.xml" and "docProps/app.xml") and the root rels, looks like this:

        <?xml version="1.0"?>
        <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
          <Relationship Target="xl/workbook.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Id="rId1"/>
          <Relationship Target="docProps/core.xml" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Id="rId2"/>
          <Relationship Target="docProps/app.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" Id="rId3"/>
          <Relationship Target="docProps/custom.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Id="rId4"/>
        </Relationships>
        
      • You see in the above XML, that the Workbook itself is stored in "xl/workbook.xml", so you can think of an Excel file such as Document > Workbook > Worksheets in terms of hierarchy. Every Workbook needs at least one Worksheet, as we have here. Notice also that Themes also fit into the hierarchy such as Document > Workbook > Themes as seen in the folder structure below.

      • Equally, you can infer a hierarchy such as Document > Styles and Document > vbaProject, which means they are at the same hierarchy level as the Workbook.

      • Diving deeper into the structure, you see again the _rels folder, this time one level deeper at xl/_rels/workbook.xml.rels and this means it is the relationships for the Workbook.xml file:

        <?xml version="1.0"?>
        <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
          <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="/xl/worksheets/sheet1.xml" Id="rId1"/>
          <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml" Id="rId2"/>
          <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml" Id="rId3"/>
          <Relationship Type="http://schemas.microsoft.com/office/2006/relationships/vbaProject" Target="vbaProject.bin" Id="rId4"/>
        </Relationships>
        
      • The xl/workbook.xml file itself is as such (and even though the styles, theme and vbaProject are not listed here, the relationships shown above are important, so that Excel knows these files exist):

        <?xml version="1.0"?>
        <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
          <workbookPr codeName="ThisWorkbook"/>
          <bookViews>
            <workbookView visibility="visible" minimized="0" showHorizontalScroll="1" showVerticalScroll="1" showSheetTabs="1" xWindow="-120" yWindow="-120" windowWidth="29040" windowHeight="15840" tabRatio="600" firstSheet="0" activeTab="0" autoFilterDateGrouping="1"/>
          </bookViews>
          <sheets>
            <sheet xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" name="Sheet1" sheetId="1" state="visible" r:id="rId1"/>
          </sheets>
          <definedNames/>
          <calcPr calcId="191029" fullCalcOnLoad="1"/>
        </workbook>
        
      • Finally, looking into the xl/worksheets/sheet1.xml we see how the sheet data is saved. Here we just have the word "test" saved in cell B1:

        <?xml version="1.0"?>
        <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
          <sheetPr codeName="Sheet1">
            <outlinePr summaryBelow="1" summaryRight="1"/>
            <pageSetUpPr/>
          </sheetPr>
          <dimension ref="B1:B1"/>
          <sheetViews>
            <sheetView tabSelected="1" workbookViewId="0">
              <selection activeCell="B3" sqref="B3"/>
            </sheetView>
          </sheetViews>
          <sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>
          <sheetData>
            <row r="1">
              <c r="B1" s="1" t="inlineStr"><is><t>Test</t></is></c>
            </row>
          </sheetData>
          <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
        </worksheet>
        
      • Currently, Sheet1 has no relationships, but as we add more features into the sheet, this relationship file would be created at xl/worksheets/_rels/sheet1.xml.rels and inform us, what extra data (as XML files or sometimes .bin files) is related to sheet1.

        If you rename an excel file such as outfile.xlsm to outfile.zip, you can then extract all files and see the structure of the OOXML document:

        enter image description here