Search code examples
htmlpandocasciidoc

Cyrillic symbols processing in pandoc when converting HTML to ADOC


I have an HTML file that is written in Russian and I want to convert it to an ADOC file using pandoc.

<!DOCTYPE html
  SYSTEM "about:legacy-compat">
<html lang="ru-ru"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2021"><meta name="DC.rights.owner" content="(C) Copyright 2021"><meta name="DC.type" content="task"><meta name="DC.relation" scheme="URI" content="../topics/ManageEmployees.html"><meta name="prodname" content="Docsvision 5. Web-клиент"><meta name="prognum" content="5.5.16"><meta name="docver" content="1.0"><meta name="DC.format" content="HTML5"><meta name="DC.identifier" content="DeleteEmployee"><meta name="DC.language" content="ru-ru"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Удаление сотрудника</title></head><body id="DeleteEmployee"><main role="main"><article role="article" aria-labelledby="ariaid-title1">

    <h1 class="title topictitle1" id="ariaid-title1">Удаление сотрудника</h1>
    <div class="body taskbody">
        
        <section><div class="li stepsection"><p class="p">Для удаления ранее созданного сотрудника:</p></div><ol class="ol steps"><li class="li step">
                <span class="ph cmd">В правой области справочника выберите сотрудника, которого необходимо
                    удалить.</span>
            </li><li class="li step">
                <span class="ph cmd">Вызовите контекстное меню на выбранном сотруднике.</span>
            </li><li class="li step">
                <span class="ph cmd">Выберите в контекстном меню пункт  <span class="keyword parmname">Удалить</span>.</span>
            </li><li class="li step">
                <span class="ph cmd">Появится предупреждение, подтвердите действие кнопкой
                        <span class="ph uicontrol">ОК</span>.</span>
            </li></ol></section>
        <section class="section result" id="DeleteEmployee__result_lv3_2pt_y4b">
            <div class="note note note_note"><span class="note__title">Прим.:</span> Сотрудник будет полностью удалён из справочника.</div>
        </section>
    </div>
<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>На уровень выше:</strong> <a class="link" href="../topics/ManageEmployees.html">Работа с сотрудниками</a></div></div></nav></article></main></body></html>

I am using the following command:

pandoc --wrap=none -f html -t asciidoc .\topics\CreateDocumentCard.html > ..\output\file.adoc.

The conversion goes fine, it produced the output, but the output only supports Latin characters. All the Cyrillic characters look like mumbo jumbo. The output and the preview in IntelliJ Idea look like this:

preview in IntelliJ Idea

You can see that Latin chars are processed normally.

I made some searching and found out that some people experience similar issues with Cyrillic symbols when processing PDF files. So tried adding similar parameters to the command line like this:

-V mainfont='My Font' -V lang -V babel-lang=russian

It didn't work, however.

I also tried the online version of pandoc here with the same HTML source and for some reason, it converted just fine.

pandoc online

And I get the same result when converting from md to adoc.

I need Cyrillic characters to be properly displayed when converting HTML/MD to AsciiDoc with pandoc from the command prompt. How can I achieve that?


Solution

  • Pandoc produces UTF-8 encoded output, while Windows uses UTF-16 by default. The problem stems from using a redirect to pipe the output to a file, as the new file will be written using UTF-16. The solution is therefore to let pandoc write the output to a file via the -o file.adoc (or --output file.adoc) command line option, thereby ensuring that the file has UTF-8 encoding as well.