The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:
I'd like to programmatically extract the data and the structure from these tables.
Things I've tried: converting the PDF to HTML using
I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.
The output could be JSON (e.g. one object per table), XML, or pretty much any format that maintains the structure.
You could try PDFBox. The documentation for that is here:
https://pdfbox.apache.org/1.8/cookbook/textextraction.html
Extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions. You can set up text regions to determine which numbers/letters/characters are drawn in which region. Since you know the layout of the regions are tabular you'll be able to define tables and tell which column and row the extracted text belongs to using simple algorithms.