Search code examples
c#office-interop

c# ms word get visible text


I'm trying to obtain the text shown in a MS Word window in C# using Microsoft.Office.Interop.Word. Please note it's not the whole document or even the page; just the same content the user sees.

The following code seems to work with simple documents:

Application word = new Application();
word.Visible = true;
object fileName = @"example.docx";
word.Documents.Add(ref fileName, Type.Missing, Type.Missing, true);

Rect rect = AutomationElement.FocusedElement.Current.BoundingRectangle;

Range r1 = word.ActiveWindow.RangeFromPoint((int)rect.Left, (int)rect.Top);
Range r2 = word.ActiveWindow.RangeFromPoint((int)rect.Left + (int)rect.Width, (int)rect.Top + (int)rect.Height);
r1.End = r2.Start;

Console.WriteLine(r1.Text.Replace("\r", "\r\n"));

However, when the document includes other structures such as headers, only parts of the text are returned.

So, what's the correct way to achieve this?

Thanks a lot!

Updated Code

Rect rect = AutomationElement.FocusedElement.Current.BoundingRectangle;

foreach (Range r in word.ActiveDocument.StoryRanges) {
    int left = 0, top = 0, width = 0, height = 0;
    try {
        try {
            word.ActiveWindow.GetPoint(out left, out top, out width, out height, r);
        } catch {
            left = (int)rect.Left;
            top = (int)rect.Top;
            width = (int)rect.Width;
            height = (int)rect.Height;
        }
        Rect newRect = new Rect(left, top, width, height);
        Rect inter;
        if ((inter = Rect.Intersect(rect, newRect)) != Rect.Empty) {
            Range r1 = word.ActiveWindow.RangeFromPoint((int)inter.Left, (int)inter.Top);
            Range r2 = word.ActiveWindow.RangeFromPoint((int)inter.Right, (int)inter.Bottom);
            r.SetRange(r1.Start, r2.Start);

            Console.WriteLine(r.Text.Replace("\r", "\r\n"));
        }
    } catch { }
}

Solution

  • There may be some problems with this:

    • Its not reliable. Are you truly able to get consistent results each time? For example, on a simple "=rand()" document, run the program 5 times in a row without changing the state of Word. When I do this, I get a different range printed to the console each time. I would first start here: there seems to be something wrong with your logic for getting the ranges. For example, rect.Left keeps returning different numbers every time I execute it against the same document left alone on screen
    • It gets tricky with other stories. Perhaps RangeFromPoint cannot
      extend across multiple story boundaries. However, lets assume it does. You would still need to enumerate each story e.g.

    enumerator = r1.StoryRanges.GetEnumerator(); { while (enumerator.MoveNext() { Range current = (Range) enumerator.Current; } }

    Have you tried to look at How to programmatically extract the text of the currently viewed page of an Office.Interop.Word.Document object ?