I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:
public static string ExtractTextFromPDF(string filePath)
{
var pdfReader = new PdfReader(filePath);
var pdfDoc = new PdfDocument(pdfReader);
var sb = new StringBuilder();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
var strategy = new SimpleTextExtractionStrategy();
sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
}
pdfDoc.Close();
pdfReader.Close();
return sb.ToString();
}
But I run into the exception:
iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was not found.'
I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?
Thanks
Encoding CMaps in particular for CJK scripts are in a separate package.
For .Net use itext7.font-asian
via nuget.
For Java use com.itextpdf:font-asian
via maven.
The existence of this package is more visible for the Java version than for the .Net version.