Search code examples
ruby-on-railsrubypdfdocsplit

docsplit gem pdf to text


Well basically I have the same problems as discussed here: http://blog.joshsoftware.com/2014/08/13/pdf-to-plain-text-processing-using-docsplit/ But the solution that they propose in docsplit doesn't work.

 Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, output: ‘tmp_text_file’})

the :pdf_opts => ‘-layout’ option doesn't do anything and I can't find any documentation about options like that, thus I get a single word per line in the output text file.

Does anyone know how to get an accurate text file ?

Thank you


Solution

  • If you read blog post carefully internally processing

     :pdf_opts => ‘-layout’
    

    is not supported yet by master branch of docsplit gem. For this you need to use https://github.com/documentcloud/docsplit/pull/114. So use

    gem 'docsplit', git: 'git://github.com/narutosanjiv/docsplit.git'
    

    Hope this helps. Let me know if you still face any issues.