Search code examples
pythonpdfpython-camelot

Camelot: table_area and table_regions do not work as expected


I have been trying to make Camelot work on specific areas of pdf pages for a good couple of days but it keeps puzzling me. I reviewed and tried the docs suggestions, a few bug reports and this SO question to no avail. I could use some help.

I took an example from the docs, since it has more than one table, this one. I amended the original command to extract only one of the two tables, from:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')

to:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')

Whereas:

  • I changed the regex because it was eliminating spaces between words,
  • used table_area instead of the docs' table_areas because the former triggers the elaboration, while the second an error (the bug is explained here, and the docs still seem to be wrong)
  • tried to extract both tables and checked the respective areas using camelot's plot feature as explained in the docs here, so they should be right,
  • tried also using table_regions and at least it pulls one table out instead of two, but it remains rather inaccurate (see comments below)

So here are the results of my trials on the pdf mentioned above:

First one: using table_area on the '35,591,385,343' PDF area (top table)

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

Notice how the tables are two, and it includes unwanted text both at the top and bottom, which should not be inside the area chosen using plot().

Second: using table_regions on the same '35,591,385,343' PDF area, top table

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,... 

Just one table, same issue with unwanted text outside the selected area, apparently.

Third: Using table_area on the '33,297,386,65' PDF area (bottom table)

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

It picks up both tables and clearly the first one remains the top one. Same issue with unwanted text, but it is now expected.

Fourth: Using table_regions on the '33,297,386,65' PDF area (bottom table)

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0           1          2          3               4              5
0                    Table 325. Arrests by Race: 2009                                                                 
1   [Based on Uniform Crime Reporting (UCR) Progra...                                                                 
2   with a total population of 239,839,971 as esti...                                                                 
3                                                                                              American               
4                                     Offense charged                                    Indian/Alaskan  Asian Pacific
5                                                           Total      White      Black          Native       Islander
6   Total  . . . . .  . .  .  . .  .  . . .  .  . ...  10,690,561  7,389,208  3,027,153         150,544        123,656
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...     456,965    268,346    177,766           5,608          5,245
8     Murder and nonnegligent manslaughter . .. ... .       9,739      4,741      4,801             100             97
9   Forcible rape . . . . . . . .. .. .. .. .... ....      16,362     10,644      5,319             169            230
10  Robbery . . . . .. . . . ... . ... . .... .......     100,496     43,039     55,742             726            989
11  Aggravated assault  . . . . . . . .. .. .........     330,368    209,922    111,904           4,613          3,929
....
34  All other offenses (except traffic) . .. .. .....   2,929,217  1,937,221    911,670          43,880         36,446
35  Suspicion . . .. . . . .. .. .. .. .. .. .. .....       1,513        677        828               1              7
36  Curfew and loitering law violations  . .. ... ...      89,578     54,439     33,207             872          1,060
37  Runaways  . . . . . . . .. .. .. .. .. .. .......      73,616     48,343     19,670           1,653          3,950
38           1 Except forcible rape and prostitution.

Better, yet it picks up unwanted text as above.

I would really value suggestions or pointers. Thanks in advance!


Solution

  • table_areas (not table_area) keyword argument works well and should be used (I use Camelot 0.7.3).

    tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')
    

    returns:

    enter image description here

    which seems to be right.