Search code examples
c#ms-officeoffice-interop

Parse table using Microsoft.Office.Interop.Word, get only text from first column?


I am working on writing a program that will parse text data from a Microsoft Word 2010 document. Specifically, I want to get text from each cell in the first column of every table in the document.

For reference, the document looks likes this: enter image description here

I only need text from the cells in the first column on each page. I'm going to add this text into an internal datatable.

My code, so far, looks like this:

private void button1_Click(object sender, EventArgs e)
    {
        // Create an instance of the Open File Dialog Box
        var openFileDialog1 = new OpenFileDialog();

        // Set filter options and filter index
        openFileDialog1.Filter = "Word Documents (.docx)|*.docx|All files (*.*)|*.*";
        openFileDialog1.FilterIndex = 1;
        openFileDialog1.Multiselect = false;

        // Call the ShowDialog method to show the dialog box.
        openFileDialog1.ShowDialog();
        txtDocument.Text = openFileDialog1.FileName;

        var word = new Microsoft.Office.Interop.Word.Application();
        object miss = System.Reflection.Missing.Value;
        object path = openFileDialog1.FileName;
        object readOnly = true;
        var docs = word.Documents.Open(ref path, ref miss, ref readOnly, 
                                       ref miss, ref miss, ref miss, ref miss, 
                                       ref miss, ref miss, ref miss, ref miss, 
                                       ref miss, ref miss, ref miss, ref miss, 
                                       ref miss);

        // Datatable to store text from Word doc
        var dt = new System.Data.DataTable();
        dt.Columns.Add("Text");

        // Loop through each table in the document, 
        // grab only text from cells in the first column
        // in each table.
        foreach (Table tb in docs.Tables)
        {
            // insert code here to get text from cells in first column
            // and insert into datatable.
        }

        ((_Document)docs).Close();
        ((_Application)word).Quit();
    }

I'm stuck at the part where I grab the text from each cell and add it to my datatable. Can someone offer me some pointers? I'd sure appreciate it.

Thanks!


Solution

  • I don't know how you would like to store it in your database, but to read the text I think you could loop out the rows and pick the first column in each:

    foreach (Table tb in docs.Tables) {
        for (int row = 1; row <= tb.Rows.Count; row++) {
            var cell = tb.Cell(row, 1);
            var text = cell.Range.Text;
    
            // text now contains the content of the cell.
        }
    }