I'm using iText 7 to extract text from PDFs, and superscript/subscript characters are regularly showing up on the line above or below.
I've tracked this down to the SameLine
method of TextChunkLocation
, and I'm creating a custom version of this class so I can tweak the logic (along with dealing with some other things as well, such as automatically truncating headers and footers). But I'm a little stymied about the last line of this method in the default implementation:
public virtual bool SameLine(ITextChunkLocation @as) {
if (OrientationMagnitude() != @as.OrientationMagnitude()) {
return false;
}
float distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (distPerpendicularDiff == 0) {
return true;
}
LineSegment mySegment = new LineSegment(startLocation, endLocation);
LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
return Math.Abs(distPerpendicularDiff) <=
DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION
&& (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
}
I understand the comparison to DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION
. I don't understand why at least one of the two line segments must have a length of 0 for them to be "on the same line."
If a text chunk has no diagonal line length, wouldn't that mean the text chunk is empty, and thus it's a moot point to wonder if it is or isn't on the same line?
How would this logic ever return true
for diacritical marks... or for any other situation where two text chunks should be on the same line but are slightly misaligned?
Depending on the font and the diacritical mark in question the font may not contain each needed combination of character and mark as glyph. Instead there may merely be an individual glyph with the mark in question, and a character with that mark is drawn by drawing the character glyph and the mark glyph at the same position.
To allow them to be drawn at the same position, the glyph of the mark, albeit not actually having a zero width, when drawn does not advance the glyph drawing position.
Furthermore, a diacritical mark may have to be drawn at different heights depending on the character it is combined with, in particular if the character is combined with multiple marks.
In the context of your question, therefore, ...
If a text chunk has no diagonal line length, wouldn't that mean the text chunk is empty, and thus it's a moot point to wonder if it is or isn't on the same line?
If a diacritical mark glyph is combined with a character glyph at a different height than normal, that mark glyph forms a text chunk by itself with a DistPerpendicular
value differing from the values of chunks around it on the same line.
As the mark glyph doesn't advance the text insertion point and the length of the chunk baseline essentially is the sum of the character advancements of the drawn glyphs, the length of a chunk containing only a diacritical mark, is 0.
To recognize such a situation, therefore, iText checks whether the DistPerpendicular
difference, albeit not 0, is not too large, and whether one of these chunks in question has zero length:
return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION
&& (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);