Search code examples
cssxhtmlms-worddreamweavercontent-management

What is the best way to get clean semantic XHTML from MS word documents?


Some days ago I received a rather lengthy and somewhat elaborate MS Word document, which I was asked to convert to HTML for uploading to a 3rd party’s website. My first instinct was to save the Word document as HTML and use Dreamweaver’s "Clean Up Word HTML" Command. But not only did I have to leave it running all night for Dreamweaver to finish "cleaning", but the results were far from desirable in my opinion. There were still a lot of left over inline styles, etc. that Dreamweaver just plain missed.

I approached it differently this morning and just selected the entire document in Word, copied it, and then pasted it into Dreamweaver’s Design window. Not only was it much, much faster, but the output code was much, much cleaner! I didn’t have to run the "Clean Up Word HTML" Command afterwords either.

Now I don't ever convert a Word File straight to HTML for standards reasons. Instead I cut and paste content between Word and Dreamweaver. Happily I can do the following.

  1. If a Word heading is in the Heading 1 Style, it will become an H1 in Dreamweaver (following the Dreamweaver stylesheet). Similarly Heading 2 becomes H2, Heading 3 becomes H3 and so forth.

    If the Word author wasn't that organized, you can use a shortcut like Control+1 (or Command+1) on a Mac to convert any line to an H1. Can you guess the shortcut for H2? Yes it's Control+2 or Command+2 on a Mac.

  2. Paragraphs now cut and paste as paragraphs (with the P tag). If you don't want an HTML paragraph right then, then use Control+0 (or Command+0 on a Mac) to remove it in Dreameaver.

  3. A new one I discovered is that some embedded images in Word may be transferred to your Dreamweaver site as "clip" images when you copy and paste from Word. So, if you have a Word file with embedded images, you may be able to extract them fairly quickly via Dreamweaver.

I also found this free tool useful http://www.textfixer.com/html/convert-word-to-html.php it works same like design view of dreamweaver, useful for people who doesn't have Dreamweaver.

but what code we will get is depends on how much properly formatted MS word document is?

WORD 2007 has also style like html?

Headings, tables, ordered and unordered lists, bold, italic , hyperlinks etc?

How to use word 2007 semantically?

  • To get maximum possible semantic html on save as html option

  • To get maximum possible clean code to Copy in dreamweaver design view ?

  • To get maximum possible clean code to place browser based WYSIWYG HTML
    Editor which comes with every CMS

Does any knows any tips, tricks, tutorial , article or advice to format MS WORD documents semantically?

Or any other best way than mine?


Solution

    • HTML Tidy has options for this: word-2000, bare and clean.

    • FCKEditor and similar try to clean up code pasted from Word.

    • There's (rather old now) demoroniser.

    However don't expect miracles. It's unlikely that Word document will have decent structure (it theoretically could, but no Word user bothers with this). These programs can't add semantic information if it's not there.

    As for semantic editing in Word – use styles. It supports headers properly (sadly not much else). You can check that in outline view.

    You don't need – and shouldn't use – spaces or line breaks for indentation or space adjustment. Word has ability to explicitly control paragraphs' padding.