I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.
// Open PDF document
PDDocument document = null;
try {
document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
} catch (IOException e) {
e.printStackTrace();
}
// Get all pages and loop through them
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() ) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = null;
// Get all Images on page
try {
images = resources.getImages();//How to specify watermark instead of images??
} catch (IOException e) {
e.printStackTrace();
}
if( images != null ) {
// Check all images for metadata
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() ) {
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
PDMetadata metadata = image.getMetadata();
System.out.println("Found a image: Analyzing for Metadata");
if (metadata == null) {
System.out.println("No Metadata found for this image.");
} else {
InputStream xmlInputStream = null;
try {
xmlInputStream = metadata.createInputStream();
} catch (IOException e) {
e.printStackTrace();
}
try {
System.out.println("--------------------------------------------------------------------------------");
String mystring = convertStreamToString(xmlInputStream);
System.out.println(mystring);
} catch (IOException e) {
e.printStackTrace();
}
}
// Export the images
String name = getUniqueFileName( key, image.getSuffix() );
System.out.println( "Writing image:" + name );
try {
image.write2file( name );
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
}
System.out.println("--------------------------------------------------------------------------------");
}
}
}
In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.
Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.
Watermarks can be
Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).
The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.
If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.