Search code examples
c#xmlzipdocx

How to target a file using a different extension?


I would like to open a .docx file as if it were a .zip, to gain access to its .xml.

I have tried the MSDN ZipFile example, by giving it the path to the .docx file, but it doesn't seem to produce a file.

I am trying to get the word count which can be found in docProps\app.xml file.

For example, here, instead of "file.zip", I would like to put "file.docx". A .docx is openable as a .zip if you change its extension (it contains various XML files and others), but I don't know how to do this extension change directly in the program, or even if it's possible.

using System;
using System.IO.Compression;

class Program
{
    static void Main(string[] args)
    {
        string zipPath = @".\file.zip";
        string extractPath = @".\extract";

        ZipFile.ExtractToDirectory(zipPath, extractPath);
    }
}

Solution

  • You can obtain the word count very easily using the DocumentFormat.OpenXml library from NuGet.

    using DocumentFormat.OpenXml.Packaging;
    
    var filePath = @"C:\MyDocument.docx";
    
    int wordCount = 0
    using (WordprocessingDocument doc = WordprocessingDocument.Open(filePath, false))
    {
        int.TryParse(doc.ExtendedFilePropertiesPart?.Properties?.Words?.Text, out wordCount);
    }
    

    If for whatever reason you do not wish to use this library, then here is an alternative solution.

    using System.IO;
    using System.IO.Compression;
    using System.Xml;
    
    //Open the .docx file as an archive and copy the relevant XML file to a memory stream.
    int wordCount = 0;
    Stream ms = new MemoryStream();
    using (var zip = ZipFile.Open(filePath, ZipArchiveMode.Read))
    {
        var xmlFile = zip.Entries.Where(_ => _.FullName == "docProps/app.xml").FirstOrDefault();
     
        //If this XML file is not found, do nothing, not sure of a scenario where this would be the case but you never know.
        if (xmlFile == null)
            return;
     
        var xmlStream = xmlFile.Open();
        xmlStream.CopyTo(ms);
    }
     
    //Now open that XML file and use XPath to select the word count node.
    ms.Seek(0, SeekOrigin.Begin);
    var xml = new XmlDocument();
    xml.Load(ms);
     
    var xmlNsMgr = new XmlNamespaceManager(xml.NameTable);
    xmlNsMgr.AddNamespace("ep", "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties");
     
    var wordNode = xml.SelectSingleNode("//ep:Words", xmlNsMgr);
    
    if (wordNode != null)
        int.TryParse(wordNode.InnerText, out wordCount);