Search code examples
xpathdocxopenxmlwordprocessingml

How to grab text from word (docx) document in C#?


I'm trying to get the plain text from a word document. Specifically, the xpath is giving me trouble. How do you select the tags? Here's the code I have.

public static string TextDump(Package package)
{
    StringBuilder builder = new StringBuilder();

    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream());

    foreach (XmlNode node in xmlDoc.SelectNodes("/descendant::w:t"))
    {
        builder.AppendLine(node.InnerText);
    }
    return builder.ToString();
}

Solution

  • Your problem is the XML namespaces. SelectNodes don't know how to translate <w:t/> to the full namespace. Therefore, you need to use the overload, that takes an XmlNamespaceManager as the second argument. I modified your code a bit, and it seems to work:

        public static string TextDump(Package package)
        {
            StringBuilder builder = new StringBuilder();
    
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load(package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream());
            XmlNamespaceManager mgr = new XmlNamespaceManager(xmlDoc.NameTable);
            mgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");
    
            foreach (XmlNode node in xmlDoc.SelectNodes("/descendant::w:t", mgr))
            {
                builder.AppendLine(node.InnerText);
            }
            return builder.ToString();
        }