Search code examples
c#.netparsingdocxdoc

.net program to parse .doc file


I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:

par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content

content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column.
The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database.
I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..


Solution

  • Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.

    I know you asked for C# app, but that is overkill for time and effort based on your problem

    So

    1. Convert '<start of line>' to '<start of line>"'
      for MS Word with Find and replace
      find: ^p
      replace: ^&"

    2. Convert ' - ' to '","'
      for MS Word with Find and replace
      find: ' - ' Note: don't add tick marks.
      replace: ","

    3. Convert '<end of line>' to '"<end of line>'
      for MS Word with Find and replace
      find: ^p
      replace: "^&

    4. Manually fix up start of first line and end of last line.

    you should get

    "par-000.01","some content"
    "par-000.21","some content"

    Now just load that into a DB using its CSV load.

    Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.