Search code examples
javatabula

How to specify the column coordinates in tabula command line


I want table data from PDF and I am using below command to get table data

java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf

But in this, two column data get mixed in some rows, so I want to specify column coordinates for getting the perfect data, but I don't know how to get column coordinate, so anyone can guide me with perfect command would be helpful.

Thanks in advance!


Solution

  • You can specify the column coordinates using the -c or --columns parameter. The coordinates you specify will be the coordinates of the delineators between columns. So if one column goes from 10.5 to 13.5 and the next column goes from 13.5 to 17.5 then you only list 13.5. You will also need to turn guess off. You didn't provide an example pdf so I can't provide you with the correct coordinates but your command would look something like this:

    java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
    

    You can read more about the different options for getting your command just right from the help command:

        $ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
    usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
           <FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
           [-s <PASSWORD>] [-t] [-u] [-v]
    
    Tabula helps you extract tables from PDFs
    
     -a,--area <AREA>           Portion of the page to analyze
                                (top,left,bottom,right). Example: --area
                                269.875,12.75,790.5,561. Default is entire
                                page
     -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory.
     -c,--columns <COLUMNS>     X coordinates of column boundaries. Example
                                --columns 10.1,20.2,30.3
     -d,--debug                 Print detected table areas instead of
                                processing.
     -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV
     -g,--guess                 Guess the portion of the page to analyze per
                                page.
     -h,--help                  Print this help text.
     -i,--silent                Suppress all stderr output.
     -l,--lattice               Force PDF to be extracted using lattice-mode
                                extraction (if there are ruling lines
                                separating each cell, as in a PDF of an Excel
                                spreadsheet)
     -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF
                                not to be extracted using spreadsheet-style
                                extraction (if there are no ruling lines
                                separating each cell)
     -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.
                                Default: -
     -p,--pages <PAGES>         Comma separated list of ranges, or all.
                                Examples: --pages 1-3,5-7, --pages 3 or
                                --pages all. Default is --pages 1
     -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force
                                PDF to be extracted using spreadsheet-style
                                extraction (if there are ruling lines
                                separating each cell, as in a PDF of an Excel
                                spreadsheet)
     -s,--password <PASSWORD>   Password to decrypt document. Default is empty
     -t,--stream                Force PDF to be extracted using stream-mode
                                extraction (if there are no ruling lines
                                separating each cell)
     -u,--use-line-returns      Use embedded line returns in cells. (Only in
                                spreadsheet mode.)
     -v,--version               Print version and exit.