Search code examples
visual-studio-codemarkdowndocxpandoc

Avoid unwanted LF linefeeds in PanDoc docx to markdown conversion


I need to convert a PDF document to markdown. Because PanDoc doesn't support PDF as an input format, I use MS Word's online conversion. The result in Word looks like this:

PDF converted by MS Word

As you can see, there's a tab character after (1), and a paragraph character at the end.

I then use PanDoc to convert this to markdown, with this command:

pandoc -s nis2.docx -wrap=none -t markdown -o nis2.md

The resulting markdown file looks like this in in VS Code:

markdown file in VS Code

I'm using the code-eol extension to show LF characters, which are displayed as downward arrows.

It appears that either PanDoc, or VS Code, has added LF characters at the end of each line to create wrapping, and 4 spaces at the beginning to create indents. I've tried the -t markdown and -t gfm output flags, the result is the same for both.

What I need to achieve is a singular long line, in this example starting with (1) and ending with society. Terminated with an LF and without extra spaces.

Any suggestions?


Solution

  • So this is a bit of silly trickery but you can set the length of lines to something ridiculous so it will always fit a whole paragraph to a line. I set up a similar example to yours called test.docx and then I run

    pandoc test.docx -o test.md --columns=3000
    

    However, i think your wrap option is missing a dash, when I run:

    pandoc test.docx -o test.md --wrap=none
    

    it also gives me your desired result.

    When i run with your -wrap=none it gives me an error, so i suspect the command you posted here is faulty anyways.

    And hopefully, last edit: What happens if you open the result markdown file in something like vim: so for me --wrap=auto clearly wraps it as multiple lines, but with --wrap=none it puts each paragraph in one line. So maybe VSCode is wrapping the text implicitly?

    ❱ pandoc --version                                                                                                                                                                                                                                                                                                                                   22 !
    pandoc 3.1.5
    Features: +server +lua
    Scripting engine: Lua 5.4
    ...
    Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
    This is free software; see the source for copying conditions. There is no
    warranty, not even for merchantability or fitness for a particular purpose.