Search code examples
c#pdfunicodeitextitext7

How do I read Japanese characters from a PDF?


I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:

    public static string ExtractTextFromPDF(string filePath)
    {
        var pdfReader = new PdfReader(filePath);
        var pdfDoc = new PdfDocument(pdfReader);
        var sb = new StringBuilder();
        for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
        }
        pdfDoc.Close();
        pdfReader.Close();
        return sb.ToString();
    }

But I run into the exception:

iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was not found.'

I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?

Thanks


Solution

  • Encoding CMaps in particular for CJK scripts are in a separate package.

    For .Net use itext7.font-asian via nuget.

    For Java use com.itextpdf:font-asian via maven.

    The existence of this package is more visible for the Java version than for the .Net version.