Search code examples

Extract text from a PDF in Railo

Just taking over coding a Railo site (Railo and I want to index a large number of PDFs. However, cfindex only seems to index text docs. I see there is <cfpdf action="extracttext">, but apparently this is not supported in Railo. Can anyone confirm or otherwise? If not is the best option org.apache.pdfbox?


  • PDFBox will certainly do the job. There's an old version included in the Railo class path, but I found it to be buggy. Instead I would use JavaLoader to load the latest version.


    /* The latest pre-built standalone PDFBox jar file and the javaloader package are assumed to be in the same folder as the following component */
        function init( javaLoaderPath="javaloader.JavaLoader" ){
            if( !server.KeyExists( "_pdfBoxLoader" ) ){
                var paths=[];
                paths.append( GetDirectoryFromPath( GetCurrentTemplatePath() ) & "pdfbox-app-1.8.11.jar" );
                server._pdfBoxLoader=New "#javaLoaderPath#"( paths );
            variables.reader=server._pdfBoxLoader.create( "org.apache.pdfbox.pdmodel.PDDocument" );
            variables.stripper=server._pdfBoxLoader.create( "org.apache.pdfbox.util.PDFTextStripper" );
            return this;
        string function extractText( required string pdfPath, numeric startPage=0, numeric endPage=0 ){
            if( Val( startPage ) )
                stripper.setStartPage( startPage );
            if( Val( endPage ) )
                stripper.setEndPage( endPage );
            var pdf=reader.load( pdfPath );
            var text=stripper.getText( pdf );
            return text;

    See for more detail.

    The above will also work with Lucee, Railo's successor, to which I'd strongly advise migrating.