I have been trying to make Camelot work on specific areas of pdf pages for a good couple of days but it keeps puzzling me. I reviewed and tried the docs suggestions, a few bug reports and this SO question to no avail. I could use some help.
I took an example from the docs, since it has more than one table, this one. I amended the original command to extract only one of the two tables, from:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')
to:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
Whereas:
table_area
instead of the docs' table_areas
because the former triggers the elaboration, while the second an error (the bug is explained here, and the docs still seem to be wrong)table_regions
and at least it pulls one table out instead of two, but it remains rather inaccurate (see comments below)So here are the results of my trials on the pdf mentioned above:
First one: using table_area
on the '35,591,385,343'
PDF area (top table)
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
Notice how the tables are two, and it includes unwanted text both at the top and bottom, which should not be inside the area chosen using plot()
.
Second: using table_regions
on the same '35,591,385,343'
PDF area, top table
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
Just one table, same issue with unwanted text outside the selected area, apparently.
Third: Using table_area
on the '33,297,386,65'
PDF area (bottom table)
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
It picks up both tables and clearly the first one remains the top one. Same issue with unwanted text, but it is now expected.
Fourth: Using table_regions
on the '33,297,386,65'
PDF area (bottom table)
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5
0 Table 325. Arrests by Race: 2009
1 [Based on Uniform Crime Reporting (UCR) Progra...
2 with a total population of 239,839,971 as esti...
3 American
4 Offense charged Indian/Alaskan Asian Pacific
5 Total White Black Native Islander
6 Total . . . . . . . . . . . . . . . . ... 10,690,561 7,389,208 3,027,153 150,544 123,656
7 Violent crime . . . . . . . . . . . ... 456,965 268,346 177,766 5,608 5,245
8 Murder and nonnegligent manslaughter . .. ... . 9,739 4,741 4,801 100 97
9 Forcible rape . . . . . . . .. .. .. .. .... .... 16,362 10,644 5,319 169 230
10 Robbery . . . . .. . . . ... . ... . .... ....... 100,496 43,039 55,742 726 989
11 Aggravated assault . . . . . . . .. .. ......... 330,368 209,922 111,904 4,613 3,929
....
34 All other offenses (except traffic) . .. .. ..... 2,929,217 1,937,221 911,670 43,880 36,446
35 Suspicion . . .. . . . .. .. .. .. .. .. .. ..... 1,513 677 828 1 7
36 Curfew and loitering law violations . .. ... ... 89,578 54,439 33,207 872 1,060
37 Runaways . . . . . . . .. .. .. .. .. .. ....... 73,616 48,343 19,670 1,653 3,950
38 1 Except forcible rape and prostitution.
Better, yet it picks up unwanted text as above.
I would really value suggestions or pointers. Thanks in advance!
table_areas (not table_area) keyword argument works well and should be used (I use Camelot 0.7.3).
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')
returns:
which seems to be right.