Search code examples
c#office-interop

Fastest way to read word files


I'm using the "Microsoft Interop Library" to read the word files. I have more than 100 Word files and it takes a long time to read only 150 Paragraphs of all these files with Interop.

Is there a faster library or another way to read?

  Application word = new Application();
  Document doc = new Document();

  object fileName = "";
  // Define an object to pass to the API for missing parameters
  object missing = System.Type.Missing;
  doc = word.Documents.Open(ref fileName,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing);

  String read = string.Empty;
  List<string> data = new List<string>();
  for (int i = 0; i < 150; i++) //Read Only 150 Paragraphs
  {
      string temp = doc.Paragraphs[i + 1].Range.Text.Trim();
      if (temp != string.Empty)
          data.Add(temp);
  }                

  foreach (var paragraphs in data)
  {
      Console.WriteLine(paragraphs);
  }

  ((_Document)doc).Close();
  ((_Application)word).Quit();

Solution

  • For text-only extracting you can search for <w:t> elements in the word file (docx is a zip archive of xml files). Please check this assumptions (document data is in word/document.xml) with 7zip before you use it.

    // using System.IO.Compression;
    // using System.Xml;
    
    /// <summary>
    /// Returns every paragraph in a word document.
    /// </summary>
    public IEnumerable<string> ExtractText(string filename)
    {
        // Open zip compressed xml files.
        using var zip = ZipFile.OpenRead(filename);
        // Search for document content.
        using var stream = zip.GetEntry("word/document.xml")?.Open();
        if (stream == null) { yield break; }
        using var reader = XmlReader.Create(stream);
        while (reader.Read())
        {
            // Search for <w:t> values in document.xml
            if (reader.NodeType == XmlNodeType.Element && reader.LocalName == "t")
            {
                yield return reader.ReadElementContentAsString();
            }
        }
    }
    

    Usage:

    foreach (var paragraph in ExtractText("test.docx"))
    {
        Console.WriteLine("READ A PARAGRAPH");
        Console.WriteLine(paragraph);
    }