I need to convert a few large documents to a database and I have the files in
xml
, xhtml
, epub
and pdf
.
Assuming the files themselves are completely faultless, which of these formats will enable me to extract the text with the least mistakes and missing elements?
I am guessing that pdf will likely be the worst performing (I remember seeing a table of extraction performances where the best library had 98% and most were below), but I included it in the list just in case I am mistaken.
Many thanks in advance!
The problem with PDF is that, at worst, it's just a bunch of individual characters placed at particular co-ordinates on a page. (I.e., words and lines of text are all in the eye of the beholder.) Now, the particular PDF files you have might be better behaved than that, but I don't know. In any case, PDF files are complex data structures, so parsing them is complex, and extracting the text is not straightforward.
Now, technically, an xml or xhtml file could be just as hairy as a PDF. (E.g., you could have an xml file that is just a list of elements like <letter loc="234,1743">A</letter>
.) But in practice, they aren't. If you can look at an xml/xhtml file and see the text you're interested in, then it will probably be easy to extract it programmatically.
Epub would be comparable to xml/xhtml in terms of losslessness, but might be a bit more complicated to deal with.
It would probably be a good idea to find out how the documents were authored, and how the various formats were derived. (I.e., if the assumption that the files are faultless is incorrect, that might have a bigger effect on the choice of format to use.)