I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}
When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.
You can read a Word document without using Word. A docx
document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx
file or use a library like NPOI which simplifies working with Open XML.
The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :
using (var document = WordprocessingDocument.Open(fileName, false))
{
var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}
You'll find the Open XML documentation, including the strucrure of Word documents at MSDN
Avoiding Owner Files
Word or Excel Files that start with ~
are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.
To avoid these one only needs to check whether the filename starts with ~
.
fileName
is only the file name and extension, fileName.StartsWith("~")
is enoughfileName
is an absolute path, `Path.GetFileName(fileName).StartsWith("~")Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles
or DirectoryInfo.EnumerateFiles
are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles
, eg :
var dir=new DirectoryInfo(folderPath);
foreach(var file in dir.EnumerateFiles("*.docx"))
{
if (!file.Name.StartsWith("~"))
{
...
}
}
or, using LINQ :
var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
.Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
...
}
Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:
var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions
{
IgnoreInaccessible =true
};
var files=dir.EnumerateFiles("*.docx",options)
.Where(file=>!file.Name.StartsWith("~"));
One option is to