Search code examples
pythontabula

Reading Tables as string from PDF with Tabula


I am using tabula-py 2.0.4, pandas 1.17.4 on python 3.7. I am trying to read PDF tables to dataframe with tabula.read_pdf

from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])

The problem is that the values are read as float instead of string.

I need it to be read as string, so if the value is 20.0000, I know that accuracy is to the fourth decimal. Now it returns 20.0 instead of 20.0000.

Input data in PDF looks like enter image description here

The output with above code is

enter image description here


Solution

  • You need to add a couple of options to tabula.read_pdf. An example that parses a pdf-file and interprets the columns found differently:

    import tabula
    
    print(tabula.environment_info())
    
    fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
             "data.pdf")
    
    # Columns iterpreted as str
    col2str = {'dtype': str}
    kwargs = {'output_format': 'dataframe',
              'pandas_options': col2str,
              'stream': True}
    df1 = tabula.read_pdf(fname, **kwargs)
    
    print(df1[0].dtypes)
    print(df1[0].head())
    
    # Guessing column type
    col2val = {'dtype': None}
    kwargs = {'output_format': 'dataframe',
              'pandas_options': col2val,
              'stream': True}
    df2 = tabula.read_pdf(fname, **kwargs)
    
    print(df2[0].dtypes)
    print(df2[0].head())
    

    With the following output:

    Python version:
        3.7.6 (default, Jan  8 2020, 13:42:34) 
    [Clang 4.0.1 (tags/RELEASE_401/final)]
    Java version:
        openjdk version "13.0.2" 2020-01-14
    OpenJDK Runtime Environment (build 13.0.2+8)
    OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
    tabula-py version: 2.0.4
    platform: Darwin-19.3.0-x86_64-i386-64bit
    uname:
        uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan  9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
    linux_distribution: ('Darwin', '19.3.0', '')
    mac_ver: ('10.15.3', ('', '', ''), 'x86_64')
    
    None
    'pages' argument isn't specified.Will extract only from page 1 by default.
    Unnamed: 0    object
    mpg           object
    cyl           object
    disp          object
    hp            object
    drat          object
    wt            object
    qsec          object
    vs            object
    am            object
    gear          object
    carb          object
    dtype: object
              Unnamed: 0   mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
    0          Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4    4
    1      Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4    4
    2         Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
    3     Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
    4  Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
    'pages' argument isn't specified.Will extract only from page 1 by default.
    Unnamed: 0     object
    mpg           float64
    cyl             int64
    disp          float64
    hp              int64
    drat          float64
    wt            float64
    qsec          float64
    vs              int64
    am              int64
    gear            int64
    carb            int64
    dtype: object
              Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
    0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
    1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
    2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
    3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
    4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2