I would like to open a .docx file as if it were a .zip, to gain access to its .xml.
I have tried the MSDN ZipFile example, by giving it the path to the .docx file, but it doesn't seem to produce a file.
I am trying to get the word count which can be found in docProps\app.xml
file.
For example, here, instead of "file.zip", I would like to put "file.docx". A .docx is openable as a .zip if you change its extension (it contains various XML files and others), but I don't know how to do this extension change directly in the program, or even if it's possible.
using System;
using System.IO.Compression;
class Program
{
static void Main(string[] args)
{
string zipPath = @".\file.zip";
string extractPath = @".\extract";
ZipFile.ExtractToDirectory(zipPath, extractPath);
}
}
You can obtain the word count very easily using the DocumentFormat.OpenXml
library from NuGet.
using DocumentFormat.OpenXml.Packaging;
var filePath = @"C:\MyDocument.docx";
int wordCount = 0
using (WordprocessingDocument doc = WordprocessingDocument.Open(filePath, false))
{
int.TryParse(doc.ExtendedFilePropertiesPart?.Properties?.Words?.Text, out wordCount);
}
If for whatever reason you do not wish to use this library, then here is an alternative solution.
using System.IO;
using System.IO.Compression;
using System.Xml;
//Open the .docx file as an archive and copy the relevant XML file to a memory stream.
int wordCount = 0;
Stream ms = new MemoryStream();
using (var zip = ZipFile.Open(filePath, ZipArchiveMode.Read))
{
var xmlFile = zip.Entries.Where(_ => _.FullName == "docProps/app.xml").FirstOrDefault();
//If this XML file is not found, do nothing, not sure of a scenario where this would be the case but you never know.
if (xmlFile == null)
return;
var xmlStream = xmlFile.Open();
xmlStream.CopyTo(ms);
}
//Now open that XML file and use XPath to select the word count node.
ms.Seek(0, SeekOrigin.Begin);
var xml = new XmlDocument();
xml.Load(ms);
var xmlNsMgr = new XmlNamespaceManager(xml.NameTable);
xmlNsMgr.AddNamespace("ep", "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties");
var wordNode = xml.SelectSingleNode("//ep:Words", xmlNsMgr);
if (wordNode != null)
int.TryParse(wordNode.InnerText, out wordCount);