Search code examples
pythonipythonjupytertabula

Tabula: FileNotFoundError: [Errno 2] (but file path is corrent)


Problem:

import tabula as tb
import pandas as pd

other = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tb.read_pdf(other, stream=True) #this works

file="D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"
tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
tables

output:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-29-c598474e8fa3> in <module>
      6 
      7 file="D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"
----> 8 tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
      9 tables

~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
    312 
    313     if not os.path.exists(path):
--> 314         raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
    315 
    316     if os.path.getsize(path) == 0:

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Favorites\x01. Programming\\Projects\\cell penetrating peptide supplemental.pdf'

It seems like everyone else who had this issue didn't get it resolved.

The first advice I followed was to check that the file actually exists.

file=r"D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"

print( os.path.isfile(file))
print(os.path.exists(file))
print(os.path.getsize(file) == 0)

output:

True
True
False

??????? why is it raising an error that it should only raise if print(os.path.exists(file)) is False?

I tried a file from the internet and it worked perfectly. The file I'm trying to read doesn't have a URL. I can't view it from my browser. I only have the option to download it. Otherwise i'd just try feeding its URL into the function.

UPDATE: I tried the suggested solution

import tabula as tb
import pandas as pd


tables = tb.read_pdf(r"D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf", pages = "all", multiple_tables = True)
tables

and got this:

Got stderr: Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UCENHU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font UCENHU+CambriaMath

Solution

  • The problem is that tabula-py has a localize_file function that is called in read_pdf. localize_file will invoke os.path.expanduser to expand the path. For example, in Unix-like systems, "~" is an alias for the user home directory. Thus os.path.expanduser will do the following expansion in Mac OS X

    >>> os.path.expanduser("~/Documents")
    '/Users/username/Documents'
    

    Unfortunately, this function has another effect: it treats \ as an escape symbol for ANSI escape codes since it invokes os.fspath inside the function. so if you run

    >>> os.path.expanduser("\125")
    'U'
    >>> os.fspath("\125")
    'U'
    

    In your case, \1 in the path has been escaped to \x01 so Windows cannot find such a directory. In order to keep your path unchanged, pass it as a raw string, i.e. put an r before it like this

    >>> os.path.expanduser(r"\125")
    '\\125'
    

    references:

    tabula's read_pdf line 311 localize_file is invoked

    tabula's localize_file line 72 os.path.expanduser is invoked

    Python's expanduser line 293 fspath is invoked

    a reference to ANSI escape sequences