Search code examples
c++pdffontspodofo

Incorrect character displacement obtained using PoDoFo


I'm using PoDoFo to extract character displacement to update a text matrix correctly. This is a code fragment of mine:

PdfString str, ucode_str;
std::stack<PdfVariant> *stack;
const PdfFontMetrics *f_metrics;
...

/* Convert string to UTF8 */
str = stack->top().GetString();
ucode_str = ts->font->GetEncoding()->ConvertToUnicode(str, ts->font);
stack->pop();
c_str = (char *) ucode_str.GetStringUtf8().c_str();

/* Font metrics to obtain a character displacement */
f_metrics = ts->font->GetFontMetrics();

for (j = 0; j < strlen(c_str); j++) {
    str_w = f_metrics->CharWidth(c_str[j]);

    /* Adjust text matrix using str_w */
    ...
}

It works well for some PDF files (str_w contains a useful width), but doesn't work for others. In these cases str_w contains 0.0. I took a look at the PoDoFo 0.9.5 sources and found CharWidth() implemented for all sub-classes of PdfFontMetrics.

Am I missing something important during this string conversion?

Update from 04.08.2017

@mkl did a really good job reviewing PoDoFo's code. However, I realized that I had to obtain a bit different parameter. To be precise, I needed a glyph width expressed in text space units (see PDF Reference 1.7, 5.1.3 Glyph Positioning and Metrics), but CharWidth() is implemented in PdfFontMetricsObject.cpp like:

double PdfFontMetricsObject::CharWidth(unsigned char c) const
{
    if (c >= m_nFirst && c <= m_nLast &&
        c - m_nFirst < static_cast<int>(m_width.GetSize())) {
        double dWidth = m_width[c - m_nFirst].GetReal();

        return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
    }

    if (m_missingWidth != NULL)
        return m_missingWidth->GetReal();
    else
        return m_dDefWidth;
}

Width is calculated using additional multipliers (like font size, character space, etc.). What I really needed was dWidth * m_matrix.front().GetReal() only. Thus, I decided to implement GetGlyphWidth(int c) from the same file like:

double PdfFontMetricsObject::GetGlyphWidth(int c) const
{
    if (c >= m_nFirst && c <= m_nLast &&
        c - m_nFirst < static_cast<int>(m_width.GetSize())) {
        double dWidth = m_width[c - m_nFirst].GetReal();
        return dWidth * m_matrix.front().GetReal();
    }
    return 0.0;
}

and call this one instead of CharWidth() from the first listing.


Solution

  • If I understand the Podofo code correctly (I'm not really a Podofo expert...), the PdfFontMetricsObject class is used to represent the metrics of fonts contained in an already existing PDF:

    /** Create a font metrics object based on an existing PdfObject
     *
     *  \param pObject an existing font descriptor object
     *  \param pEncoding a PdfEncoding which will NOT be owned by PdfFontMetricsObject
     */
    PdfFontMetricsObject( PdfObject* pFont, PdfObject* pDescriptor, const PdfEncoding* const pEncoding );
    

    The method CharWidth here is implemented like this:

    double PdfFontMetricsObject::CharWidth( unsigned char c ) const
    {
        if( c >= m_nFirst && c <= m_nLast
            && c - m_nFirst < static_cast<int>(m_width.GetSize()) )
        {
            double dWidth = m_width[c - m_nFirst].GetReal();
    
            return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
        }
    
        if( m_missingWidth != NULL )
            return m_missingWidth->GetReal ();
        else
            return m_dDefWidth;
    }
    

    One in particular sees that the parameter c is not encoded according to the font encoding but left as is for the lookup in the widths array. Thus, the expected input of this method does not appear to be a ASCII or ANSI character code but the original glyph ID.

    Your code, on the other hand, has already transformed the glyph IDs to Unicode in UTF-8 and, therefore, essentially tries to lookup by ANSI character codes.


    This would match the example documents, a typical font encoding in the PDF processed with error looks like this

    28 0 obj
    <<
      /Differences[0/B/G/W/a/d/e/f/g  9/i/l/n/o/p/r/space/t/w]
      /BaseEncoding/MacRomanEncoding
      /Type/Encoding
    >>
    endobj
    

    with glyph codes from 0 (FirstChar) to 17 (LastChar), or

    12 0 obj
    <<
      /Differences[1/A/B/C/D/F/I/L/M/N/O/P/R/T/U/a/c/d
                    /degree/e/eight/f/five/four/g/h
                   27/i/l/m/n/o/one/p/parenleft/parenright
                    /period/r/registered/s/space
                    /t/three/two/u/w/zero]
      /BaseEncoding/MacRomanEncoding
      /Type/Encoding
    >>
    endobj 
    

    with glyph codes from 1 (FirstChar) to 46 (LastChar).

    So these encoding deal glyph codes starting from 0 for all required glyphs and don't really cover that many glyphs

    Thus, CharWidth will return 0 for all char values above 17 or above 46 which means all (in the former case) or most (in the latter case) ANSI non control characters.

    On the other hand a typical font encoding in the PDF processed correctly looks like this:

    1511 0 obj
    <<
      /Type/Encoding
      /BaseEncoding/WinAnsiEncoding
      /Differences[
        1/Delta/Theta
        8/Phi
        11/ff/fi/fl/ffi
        39/quoteright
      ]
    >>
    endobj 
    

    with glyph codes from 1 (FirstChar) to 122 (LastChar).

    These encodings basically are WinAnsiEncoding with minor additions in the lower values, in particular the control character values.


    What you can do, therefore, is to iterate over glyph codes in str (allowing you to call CharWidth for them) and converting them individually to Unicode when needed instead of first converting str to Unicode ucode_str and then iterating over ANSI characters in ucode_str.