Search code examples
javapdftexthyperlinkpdfbox

How to extract hyperlink information PDFBox


I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?

Thanks


Solution

  • Use this code from the PrintURLs sample code from the source code download:

    for( PDPage page : doc.getPages() )
    {
        pageNum++;
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        List<PDAnnotation> annotations = page.getAnnotations();
        //first setup text extraction regions
        for( int j=0; j<annotations.size(); j++ )
        {
            PDAnnotation annot = annotations.get(j);
            if( annot instanceof PDAnnotationLink )
            {
                PDAnnotationLink link = (PDAnnotationLink)annot;
                PDRectangle rect = link.getRectangle();
                //need to reposition link rectangle to match text space
                float x = rect.getLowerLeftX();
                float y = rect.getUpperRightY();
                float width = rect.getWidth();
                float height = rect.getHeight();
                int rotation = page.getRotation();
                if( rotation == 0 )
                {
                    PDRectangle pageSize = page.getMediaBox();
                    y = pageSize.getHeight() - y;
                }
                else if( rotation == 90 )
                {
                    //do nothing
                }
    
                Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
                stripper.addRegion( "" + j, awtRect );
            }
        }
    
        stripper.extractRegions( page );
    
        for( int j=0; j<annotations.size(); j++ )
        {
            PDAnnotation annot = annotations.get(j);
            if( annot instanceof PDAnnotationLink )
            {
                PDAnnotationLink link = (PDAnnotationLink)annot;
                PDAction action = link.getAction();
                String urlText = stripper.getTextForRegion( "" + j );
                if( action instanceof PDActionURI )
                {
                    PDActionURI uri = (PDActionURI)action;
                    System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
                }
            }
        }
    }
    

    It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.