I am trying to read the below pdf file and I need to save each and every article in seperate file.
https://dl.dropboxusercontent.com/u/23092311/sample.pdf
A article can be in one or more than one pages. I have used PDFMiner to convert the entire pdf to txt file. But I don't know how to convert into multiple articles.
I am new to Python. Please provide a best method or sample code to extract the each and every articles separately?
I'll be honest. I've never used PDFMiner before, but if you already have the PDF into a text file, couldn't you just parse the text file into a string, and then use the split function to divide the string into different articles based on "The New York Times" heading? I guess that assumes PDFMiner is capable of reading that fancy font which I don't know if that is possible.
Looking at the file you provided, you could something like the following:
reading = open('test.txt')
full_paper = reading.read()
split_paper = full_paper.split('Copyright 2014 The New York Times Company. All Rights Reserved.')
split_paper would then be an array containing your articles in indexes 1, 2, 3, 4, 5, 6 (index 0 would contain the initial heading). You'd have to do some other string cleanup to get the exact articles, but that should at least get you started.
Make sense?