I want table data from PDF and I am using below command to get table data
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
But in this, two column data get mixed in some rows, so I want to specify column coordinates for getting the perfect data, but I don't know how to get column coordinate, so anyone can guide me with perfect command would be helpful.
Thanks in advance!
You can specify the column coordinates using the -c or --columns parameter. The coordinates you specify will be the coordinates of the delineators between columns. So if one column goes from 10.5 to 13.5 and the next column goes from 13.5 to 17.5 then you only list 13.5. You will also need to turn guess off. You didn't provide an example pdf so I can't provide you with the correct coordinates but your command would look something like this:
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
You can read more about the different options for getting your command just right from the help command:
$ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.