Search code examples
c#pdfsearchitext

Searching PDF for Underlined and Bolded text


Using iTextSharp, how can I determine if a parsed chunk of text is both bolded and underlined?

Details:
I'm trying to parse .PDF files in C# specifically for text that is both bolded and underlined. Using ITextSharp, I can derive from LocationTextExtractionStrategy and get the text, the location, the font, etc. from the iTextSharp.text.pdf.parser.TextRenderInfo object passed to the overridden .RenderText method.
However, determining if the text is Bold and/Underlined from the TextRenderInfo object has not been straight forward.

  • I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
  • I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
  • Biggest issue: I haven't found anything to determine if the text is underlined. How can I determine this?

Here is my current attempt:

        private FieldInfo _gsField = typeof(TextRenderInfo).GetField("gs",
        BindingFlags.GetField | BindingFlags.NonPublic | BindingFlags.Instance);

        //Automatically called for each chunk of text in the PDF
        public override void RenderText(TextRenderInfo renderInfo)
        {
            base.RenderText(renderInfo);
            //UNDONE:Need to determine if text is underlined.  How?

            //NOTE: renderInfo.GetFont().FontWeight does not contain any actual information
            var gs = (GraphicsState)_gsField.GetValue(renderInfo);
            var textChunkInfo = new TextChunkInfo(renderInfo);
            _allLocations.Add(textChunkInfo);
            if (gs.Font.PostscriptFontName.Contains("Bold"))
                //Add this to our found collection
                FoundItems.Add(new TextChunkInfo(renderInfo));

            if (!_lineHeights.Contains(textChunkInfo.LineHeight))
                _lineHeights.Add(textChunkInfo.LineHeight);
        }

Full source code of current attempt at: GitHub Repository (Two examples (example.pdf and example2.pdf) are included with text similar to what I'll be searching through.)


Solution

    • I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful

    • I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)

    I don't quite understand this differentiation. TextRenderInfo.GetFont() is exactly the same as the Font property of the private Graphics State field of TextRenderInfo.

    That being said, though, this is indeed one of the major ways to determine boldness.

    Bold writing in PDFs is achieved either using

    • explicitly bold fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold by

      • looking at the font name: it may contain a substring "bold" or something similar;

      • looking at some optional properties of the font, e.g. font weight, but beware, they are optional...

      • inspecting the embedded font file if applicable.

      Neither of these methods is fool-proof;

    • the same font as for non-bold text but using special techniques to make them appear bold (aka poor man's bold), e.g.

      • not only filling the glyph contours but also drawing a thicker line along it for a bold impression,

      • drawing the glyph twice, the second time slightly displaced, also for a bold impression.

    Underlined writing in PDFs is usually achieved by explicitly drawing a line or a very thin rectangle under the text. You can try and detect such lines by implementing IExtRenderListener, parsing the page in question with it to determine line locations, and then match with text positions during text extraction. Both can also be done in a single pass but beware, the underlines need not be drawn before the text or even shortly thereafter, the pdf producer may first draw all text and only then draw all underlines. Furthermore, I've also come across a funny construction, very short (e.g. 1pt) very wide (e.g. 50pt) vertical lines effectively are seen as horizontal ones...

    IExtRenderListener extends the IRenderListener with three new methods, ModifyPath, RenderPath, and ClipPath. Whenever some path is drawn, be it a single line, a rectangle, or some very complex path, you'll first get a number of ModifyPath calls (at least one)

    /**
     * Called when the current path is being modified. E.g. new segment is being added,
     * new subpath is being started etc.
     *
     * @param renderInfo Contains information about the path segment being added to the current path.
     */
    void ModifyPath(PathConstructionRenderInfo renderInfo); 
    

    defining the lines and curves the path consists of, then at most one ClipPath call

    /**
     * Called when the current path should be set as a new clipping path.
     *
     * @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
     */
    void ClipPath(int rule);
    

    (if and only if the path shall serve as clip path for the following drawing operations), and finally exactly one RenderPath call

    /**
     * Called when the current path should be rendered.
     *
     * @param renderInfo Contains information about the current path which should be rendered.
     * @return The path which can be used as a new clipping path.
     */
    Path RenderPath(PathPaintingRenderInfo renderInfo);
    

    defining how that path shall be drawn (any combination of filling its interior and stroking the path itself).

    I.e. for recognizing underlines, you'll have to collect the path pieces provided via ModifyPath and decide whether they might describe one or more underlines as soon as the RenderPath call comes.

    Theoretically underlines could also be created differently, e.g. using a bitmap image, but I'm not aware of pdf producers doing so.

    By the way, in your example PDF underlines appear consistently to be drawn using a MoveTo to the line starting point, a LineTo to its end, and then a Stroke to simply stroke the path. Thus, you'll get two ModifyPath calls (one with operation value MOVETO, one with LINETO) and one RenderPath call (with operation STROKE) respectively for each underline.