How can I extract text information such as text positional coordinates, width, height and e.t.c., ?? I tried this with 'Pdf clown' library and It works perfectly fine for normal text, but, for rotated text (90/-90degrees) it outputs width/height as 0(zero).
And the scaling factors (scaleX, scaleY) for texts with (90/-90 deg) are displaying as (0, 0) repectively, where as for inverted texts ( rotated with 180deg) it is (-1, -1).
I want info for rotated text to highlight them (as width value is zero, I am unable to highlight them). Please help me. I'm working on .NET environment.
File I am using: https://nofile.io/f/Kvf2DkXvfj4/edit9.pdf
Code: Using TextInfoExtractionSample.cs from pdfclown samples
Output (for three various alignments of text in the file above)
Text [x:283,y:104,w:126,h:-23] [font size:-24 , font sytle : ArialMT]: inverted_text
Text [x:265,y:244,w:0,h:121] [font size:0 , font sytle : ArialMT]: vertical_text
Text [x:347,y:131,w:0,h:167] [font size:0 , font sytle : ArialMT]: vertical_minus90
As I'm more at home with Java than .Net, I analyzed the problem and created a first workaround in PDF Clown / Java; I'll try and port it to .Net later. It shouldn't be too difficult, though, to do it yourself.
The sample file you provided makes the issue pretty clear when running it through the PDF Clown TextInfoExtractionSample
.
Screenshot of edit9.pdf
:
Screenshot of edit9.pdf
after applying TextInfoExtractionSample
:
Everything looks ok.
The individual character boxes (green) look ok but the box for the whole string "inverted_text" (dashed black) excludes the outermost characters.
The individual character boxes are reduced to 0x0 rectangles (invisible in the screen shot but apparent in content stream analysis). The box for the whole string is reduced to a line (dashed black) on the base line of the string missing a bit length.
The character boxes are upright, parallel to the page borders, with their base line segment inside the box. As the text is at an angle, though, the upper and lower parts of the characters partially are outside their respective character box while neighboring characters are partially inside.
The boxes for the whole strings also are parallel to the page.
The text character and string boxes only work properly for upright text.
This matches what one finds in the source code:
The Java Rectangle2D
and .Net RectangleF
classes used for the character boxes by design are meant for rectangles parallel to the coordinate system axes and are used in that manner in PDF Clown. Thus, they cannot properly represent width and height of characters at arbitrary angles.
PDF Clown classes don't include an Angle
attribute to represent the rotation of the character.
The calculation of the character box dimensions only takes the values on the main diagonal of the aggregated transformation matrix into account, i.e. ScaleX
and ScaleY
, and ignores ShearX
and ShearY
. For text which is not upright or upside down, though, ShearX
and ShearY
are important, for vertical text ScaleX
and ScaleY
are 0.
The transition from baseline (native PDF way of positioning text) to top-of-character (PDF Clown text positioning) is done by change of y coordinate alone and, therefore, only works properly for upright and upside down text.
A real fix of the issue would require using a completely different class for character and string boxes, a class that models rectangles at arbitrary angles.
A quicker work-around, though, can be to add an angle
member to the TextChar
class and to ITextString
and implementations, and then to consider that angle when processing the boxes. This work-around is implemented here.
As already mentioned above, the work-around is first implemented in Java.
First we add an angle member to TextChar
, calculate correct values for box dimensions and the angle in ShowText
operation class, and correctly set these values in the ContentScanner.TextStringWrapper
.
Then we add an angle getter to TextStringWrapper
(and ITextString
in general) which returns the angle of the first text char of the string. And we improve the TextStringWrapper
method getBox
to take the angle of the text chars into account when determining the string box.
Finally we'll extend the TextInfoExtractionSample
to take the angle values into account when drawing the boxes.
I named that angle member Alpha
as I named that angle α in my sketches. At hindsight Theta
or simply Angle
would have been more appropriate.
New member variable alpha
private final double alpha;
A new and a changed constructor
// <constructors>
public TextChar(
char value,
Rectangle2D box,
TextStyle style,
boolean virtual
)
{
this(value, box, 0, style, virtual);
}
public TextChar(
char value,
Rectangle2D box,
double alpha,
TextStyle style,
boolean virtual
)
{
this.value = value;
this.box = box;
this.alpha = alpha;
this.style = style;
this.virtual = virtual;
}
// </constructors>
A getter for the angle
public double getAlpha() {
return alpha;
}
Update inner interface IScanner
method scanChar
to transport the angle
void scanChar(
char textChar,
Rectangle2D textCharBox,
double alpha
);
(ShowText.java inner interface IScanner
)
Update scan
method to correctly calculate rectangle dimensions and angle and forward them to the IScanner
implementation
[...]
for(char textChar : textString.toCharArray())
{
double charWidth = font.getWidth(textChar) * scaledFactor;
if(textScanner != null)
{
/*
NOTE: The text rendering matrix is recomputed before each glyph is painted
during a text-showing operation.
*/
AffineTransform trm = (AffineTransform)ctm.clone(); trm.concatenate(tm);
double charHeight = font.getHeight(textChar,fontSize);
// vvv--- changed
double ascent = font.getAscent(fontSize);
double x = trm.getTranslateX() + ascent * trm.getShearX();
double y = contextHeight - trm.getTranslateY() - ascent * trm.getScaleY();
double dx = charWidth * trm.getScaleX();
double dy = charWidth * trm.getShearY();
double alpha = Math.atan2(dy, dx);
double w = Math.sqrt(dx*dx + dy*dy);
dx = charHeight * trm.getShearX();
dy = charHeight * trm.getScaleY();
double h = Math.sqrt(dx*dx + dy*dy);
Rectangle2D charBox = new Rectangle2D.Double(x, y, w, h);
textScanner.scanChar(textChar,charBox, alpha);
// ^^^--- changed
}
/*
NOTE: After the glyph is painted, the text matrix is updated
according to the glyph displacement and any applicable spacing parameter.
*/
tm.translate(charWidth + charSpace + (textChar == ' ' ? wordSpace : 0), 0);
}
[...]
Update TextStringWrapper
constructor ShowText.IScanner
callback to accept the angle argument and use it for constructing the TextChar
getBaseDataObject().scan(
state,
new ShowText.IScanner()
{
@Override
public void scanChar(
char textChar,
Rectangle2D textCharBox,
double alpha
)
{
textChars.add(
new TextChar(
textChar,
textCharBox,
alpha,
style,
false
)
);
}
}
);
A getter for the angle
public double getAlpha() {
return textChars.isEmpty() ? 0 : textChars.get(0).getAlpha();
}
A getBox
implementation that takes the angle into account
public Rectangle2D getBox(
)
{
if(box == null)
{
AffineTransform rot = null;
Rectangle2D tempBox = null;
for(TextChar textChar : textChars)
{
Rectangle2D thisBox = textChar.getBox();
if (rot == null) {
rot = AffineTransform.getRotateInstance(textChar.getAlpha(), thisBox.getX(), thisBox.getY());
tempBox = (Rectangle2D)thisBox.clone();
} else {
Point2D corner = new Point2D.Double(thisBox.getX(), thisBox.getY());
rot.transform(corner, corner);
tempBox.add(new Rectangle2D.Double(corner.getX(), corner.getY(), thisBox.getWidth(), thisBox.getHeight()));
}
}
if (tempBox != null) {
try {
Point2D corner = new Point2D.Double(tempBox.getX(), tempBox.getY());
rot.invert();
rot.transform(corner, corner);
box = new Rectangle2D.Double(corner.getX(), corner.getY(), tempBox.getWidth(), tempBox.getHeight());
} catch (NoninvertibleTransformException e) {
e.printStackTrace();
}
}
}
return box;
}
(ContentScanner.java inner class TextStringWrapper
)
New angle getter
public double getAlpha();
New angle getter
public double getAlpha() {
return textChars.isEmpty() ? 0 : textChars.get(0).getAlpha();
}
Changes to extract
to properly use the angle in outlining the boxes
[...]
for (ContentScanner.TextStringWrapper textString : text.getTextStrings())
{
Rectangle2D textStringBox = textString.getBox();
System.out.println("Text [" + "x:" + Math.round(textStringBox.getX()) + "," + "y:" + Math.round(textStringBox.getY()) + "," + "w:"
+ Math.round(textStringBox.getWidth()) + "," + "h:" + Math.round(textStringBox.getHeight()) + "] [font size:"
+ Math.round(textString.getStyle().getFontSize()) + "]: " + textString.getText());
// Drawing text character bounding boxes...
colorIndex = (colorIndex + 1) % textCharBoxColors.length;
composer.setStrokeColor(textCharBoxColors[colorIndex]);
for (TextChar textChar : textString.getTextChars())
{
// vvv--- changed
Rectangle2D box = textChar.getBox();
composer.beginLocalState();
AffineTransform rot = AffineTransform.getRotateInstance(textChar.getAlpha());
composer.applyMatrix(rot.getScaleX(), rot.getShearY(), rot.getShearX(), rot.getScaleY(),
box.getX(), composer.getScanner().getContextSize().getHeight() - box.getY());
composer.add(new DrawRectangle(0, - box.getHeight(), box.getWidth(), box.getHeight()));
composer.stroke();
composer.end();
// ^^^--- changed
}
// Drawing text string bounding box...
composer.beginLocalState();
composer.setLineDash(new LineDash(new double[] { 5 }));
composer.setStrokeColor(textStringBoxColor);
// vvv--- changed
AffineTransform rot = AffineTransform.getRotateInstance(textString.getAlpha());
composer.applyMatrix(rot.getScaleX(), rot.getShearY(), rot.getShearX(), rot.getScaleY(),
textStringBox.getX(), composer.getScanner().getContextSize().getHeight() - textStringBox.getY());
composer.add(new DrawRectangle(0, - textStringBox.getHeight(), textStringBox.getWidth(), textStringBox.getHeight()));
// ^^^--- changed
composer.stroke();
composer.end();
}
[...]
(TextInfoExtractionSample method extract
)
Both character boxes and string boxes now are as intended:
So width and height outputs now also are ok:
Text [x:415,y:104,w:138,h:23] [font size:-24]: inverted_text
Text [x:247,y:365,w:128,h:23] [font size:0]: vertical_text
Text [x:364,y:131,w:180,h:23] [font size:0]: vertical_minus90