I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:
par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content
content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01
into code
column and some content
into text column.
The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database.
I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..
Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.
I know you asked for C# app, but that is overkill for time and effort based on your problem
So
Convert '<start of line>'
to '<start of line>"'
for MS Word with Find and replace
find: ^p
replace: ^&"
Convert ' - '
to '","'
for MS Word with Find and replace
find: ' - ' Note: don't add tick marks.
replace: ","
Convert '<end of line>'
to '"<end of line>'
for MS Word with Find and replace
find: ^p
replace: "^&
Manually fix up start of first line and end of last line.
you should get
"par-000.01","some content"
"par-000.21","some content"
Now just load that into a DB using its CSV load.
Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.