Search code examples
python-3.xparsingpdfpython-camelot

How to parse table in PDF for non-english language


I was using Camelot and tabula for parsing a pdf file with Cyrillic symbols inside. But in the output CSV file, I got the messed-up font with no sign of Russian language.

What can help me to parse the pdf table in a non-English language?

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

Output: 00550529-1295-06 -ТКР5.СО1 0520529-12955--0066--ТТККРР55--ГГЧЧ23 00552299--11229955--0066--ТТККРР55--ГГЧЧ45


Solution

  • So, basically, Camelot is pretty good with Cyrillic.

    pip install camelot-py[cv]
    import pandas as pd
    import camelot
    file = 'file-name.pdf'
    tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
    df_p4 = tables[0].df
    

    The output will be pretty raw, needs cleaning, but symbols won't be broken which I assume is a good result.