Search code examples
c#.netc#-4.0ms-officeoffice-interop

How to find highlighted text from Word file in C# using Microsoft.Office.Interop.Word?


The question would have been simple but an extra clause added to it has proved to be a big headache for me. The catch here is that I do not need all highlighted "words" but "phrases" from the Word file. I have written the following code:

using Word = Microsoft.Office.Interop.Word;

private void button1_Click(object sender, EventArgs e)
{
    try
    {
        Word.ApplicationClass wordObject = new Word.ApplicationClass();
        wordObject.Visible = false;
        object file = "D:\\mywordfile.docx";
        object nullobject = System.Reflection.Missing.Value;
        Word.Document thisDoc = wordObject.Documents.Open(ref file, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject);
        List<string> wordHighlights = new List<string>();

        //Let myRange be some Range which has my text under consideration

        int prevStart = 0;
        int prevEnd = 0;
        int thisStart = 0;
        int thisEnd = 0;
        string tempStr = "";
        foreach (Word.Range cellWordRange in myRange.Words)
        {
            if (cellWordRange.HighlightColorIndex.ToString() == "wdNoHighlight")
            {
                continue;
            }
            else
            {
                thisStart = cellWordRange.Start;
                thisEnd = cellWordRange.End;
                string cellWordText = cellWordRange.Text.Trim();
                if (cellWordText.Length >= 1)   // valid word length, non-whitespace
                {
                    if (thisStart == prevEnd)    // If this word is contiguously highlighted with previous highlighted word
                    {
                        tempStr = String.Concat(tempStr, " "+cellWordText);  // Concatenate with previous contiguously highlighted word
                    }
                    else
                    {
                        if (tempStr.Length > 0)    // If some string has been concatenated in previous iterations
                        {
                            wordHighlights.Add(tempStr);
                        }
                        tempStr = "";
                        tempStr = cellWordText;
                    }
                }
                prevStart = thisStart;
                prevEnd = thisEnd;
            }
        }

        foreach (string highlightedString in wordHighlights)
        {
            MessageBox.Show(highlightedString);
        }
    }
    catch (Exception j)
    {
        MessageBox.Show(j.Message);
    }
}

Now consider the following text:

Le thé vert a un rôle dans la diminution du cholestérol, la combustion des graisses, la prévention du diabète et les AVC, et conjurer la démence.

Now suppose someone highlighted "du cholestérol", my code obviously selects two words "du" and "cholestérol". How can I make a continuously highlighted area appear as a single word? I mean "du cholestérol" should be returned as one entity in the List. Any logic that we scan the document char by char, mark the starting point of highlighting as starting point of selection, and the endpoint of highlighting as end point of selection?

P.S.: If there is a library with required capabilities in any other language, please let me know as the scenario is not language specific. I need only to get the desired results somehow.

EDIT: Modified the code with Start and End as suggested by Oliver Hanappi. But the problem still lies that if there are two such highlighted phrases, separated only by a white space, the program considers both phrases as one. Simply because it reads the Words and not spaces. May be some edits required around if (thisStart == prevEnd) ?


Solution

  • You can do this far more efficiently with Find which will search more quickly and select all the contiguous text which matches. See the reference here http://msdn.microsoft.com/en-us/library/office/bb258967%28v=office.12%29.aspx

    Here is an example in VBA which prints all occurrences of highlighted text :

    Sub TestFind()
    
      Dim myRange As Range
    
      Set myRange = ActiveDocument.Content    '    search entire document
    
      With myRange.Find
    
        .Highlight = True
    
        Do While .Execute = True     '   loop while highlighted text is found
    
          Debug.Print myRange.Text   '   myRange is changed to contain the found text
    
        Loop
    
      End With
    
    End Sub
    

    Hope this helps you understand better.