Search code examples
c#parallel.foreach

How to solve the error of Word opening in background when trying to read text from Word documents?


I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.

Parallel.ForEach(files, file =>
{
    switch (System.IO.Path.GetExtension(file))
    {
        case ".docx":
            List<string> Word_list = GetTextFromWord(file);
            SearchForWordContent(Word_list, file);
            break;
    }
});

static List<string> GetTextFromWord(string direct)
{
    if (string.IsNullOrEmpty(direct))
    {
        throw new ArgumentNullException("direct");
    }

    if (!File.Exists(direct))
    {
        throw new FileNotFoundException("direct");
    }

    List<string> word_List = new List<string>();
    try
    {
        Microsoft.Office.Interop.Word.Application app =
            new Microsoft.Office.Interop.Word.Application();
        Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);

        int count = doc.Words.Count;

        for (int i = 1; i <= count; i++)
        {
            word_List.Add(doc.Words[i].Text);
        }

        ((_Application)app).Quit();
    }
    catch (System.Runtime.InteropServices.COMException e)
    {
        Console.WriteLine("Error: " + e.Message.ToString());
    }
    return word_List;
}

Solution

  • When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.

    You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.

    The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :

    using (var document = WordprocessingDocument.Open(fileName, false))
    {
      var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
    }
    

    You'll find the Open XML documentation, including the strucrure of Word documents at MSDN

    Avoiding Owner Files

    Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.

    To avoid these one only needs to check whether the filename starts with ~.

    • If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
    • If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")

    Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :

    var dir=new DirectoryInfo(folderPath);
    
    foreach(var file in dir.EnumerateFiles("*.docx"))
    {
        if (!file.Name.StartsWith("~"))
        {
            ...
        }
    }
    

    or, using LINQ :

    var dir=new DirectoryInfo(folderPath);
    var files=dir.EnumerateFiles("*.docx")
                 .Where(file=>!file.Name.StartsWith("~"));
    foreach(var file in files)
    {
        ...
    }
    

    Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:

    var dir=new DirectoryInfo(folderPath);
    var options=new EnumerationOptions 
                { 
                    IgnoreInaccessible =true
                };
    var files=dir.EnumerateFiles("*.docx",options)
                 .Where(file=>!file.Name.StartsWith("~"));
    
    

    One option is to

    • List item
    • List item