I need to convert a PDF document to markdown. Because PanDoc doesn't support PDF as an input format, I use MS Word's online conversion. The result in Word looks like this:
As you can see, there's a tab character after (1), and a paragraph character at the end.
I then use PanDoc to convert this to markdown, with this command:
pandoc -s nis2.docx -wrap=none -t markdown -o nis2.md
The resulting markdown file looks like this in in VS Code:
I'm using the code-eol extension to show LF characters, which are displayed as downward arrows.
It appears that either PanDoc, or VS Code, has added LF characters at the end of each line to create wrapping, and 4 spaces at the beginning to create indents. I've tried the -t markdown
and -t gfm
output flags, the result is the same for both.
What I need to achieve is a singular long line, in this example starting with (1)
and ending with society.
Terminated with an LF and without extra spaces.
Any suggestions?
So this is a bit of silly trickery but you can set the length of lines to something ridiculous so it will always fit a whole paragraph to a line. I set up a similar example to yours called test.docx and then I run
pandoc test.docx -o test.md --columns=3000
However, i think your wrap option is missing a dash, when I run:
pandoc test.docx -o test.md --wrap=none
it also gives me your desired result.
When i run with your -wrap=none
it gives me an error, so i suspect the command you posted here is faulty anyways.
And hopefully, last edit: What happens if you open the result markdown file in something like vim: so for me --wrap=auto
clearly wraps it as multiple lines, but with --wrap=none
it puts each paragraph in one line. So maybe VSCode is wrapping the text implicitly?
❱ pandoc --version 22 !
pandoc 3.1.5
Features: +server +lua
Scripting engine: Lua 5.4
...
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.