Search code examples
xmlms-wordconverterssgml

Convert from XML to Microsoft Word Doc


I'm I have a batch of XML and SGML documents (about 7000 of them). I want something that'll convert them into structured Microsoft Word Documents. I've been reading online for 2 days on how to do this and am more confused than when I started.

I see you can use Open XML SDK and C# to create it with Visual Studio, via this answer here: StackOverflow answer, which links to this Using XSLT and Open XML SDK. However this is for 7 years ago. I'm not sure if this is still current, and I don't know this is definitely what I'll need.

Also the documents I'm converting from, the tags themselves are in Swedish. So I'm guessing I'll need something to read and convert the tags to english, then turn it into a Word XML format.

I can write in C# and C++, and could probably figure my way around most scripting languages if I need to for this, but have zero experience with creating word documents from code. I understand I might need to make a DTD or a XSLT and possibly use Word XML (I've learnt about these in the past 2 days), and use that in some Visual Studio project.

However I have no idea how to actually go about this. Can someone please steer me in the right direction?

Thanks


Solution

  • This topic is very broad and can't really be answered in detail with a single post...

    The information you found, dated 7 years ago, is still pertinent and valid. All versions of Word since 97 can work with the file format (2003 and earlier need the "Compatibility Pack", but most machines that have been updated will have that). Later versions than 2007 can also work with the file format as stated 7 years ago, but this will not cover newer functionality introduced in 2013/2016. This can be added with no problem, you just won't find those classes in the older documentation, but it's all there on MSDN and in the current ECMA specifications.

    The tricky part, which isn't obvious at first glance, comes from the fact that a Word Open XML document is actually a ZIP package of multiple XML and binary files and cannot be transformed directly. Rather than relying only on the link in the SO Q&A you found, you might do better to look at working directly with the OPC "flat file" format, as explained by Eric White: http://blogs.msdn.com/b/ericwhite/archive/2008/09/29/the-flat-opc-format.aspx.

    What you produce with an XSLT should result in this format. That needs to be converted to a ZIP package in order to do any further work with it using the Open XML SDK (and use version 2.5, not 2.0 from 7 years ago). The articles by Eric White provide conversion information, which will help.

    The task will definitely not be trivial as Word is a very complex beast. If sets of these documents have things in common you might progress more quickly by manually "converting" (part of) one in the Word UI to the desired result. Save and view it in the Open XML Productivity Tool where you can view the underlying Word Open XML (as well as the Open XML SDK code required to produce). That should help you "map" the original mark-up with the Word Open XML mark-up.