Using PyPDF2
to read a pdf
file with some line drawings, using code like below
from PyPDF2 import PdfFileReader
with open('temp.pdf','rb') as f:
pdf = PdfFileReader(f)
for page in pdf.pages:
print page['/Contents'].getData()
I see page content that looks like this:
q 0.24 0 0 0.24 0 0 cm
/R7 gs
8.5 w
1 J
1 j
0 0 0 RG
2361 118.961 m
2361 3388.96 l
S
2361 3388.96 m
118 3388.96 l
S
...
To me this looks like PostScript, using aliases for the operators (please correct me if I'm wrong).
Some of these aliases I believe I can decipher, e.g. m
, l
, and S
look to me like newpath moveto
, lineto
, and stroke
, respectively. However, it would be a great help if I could have a look at the alias definitions (bind def
) which I assume must be present somewhere at the start of the file.
I guess this should not be difficult, if you know how, but I have not been able to find out how to access this postscript header information using PyPDF2
(despite reading the docs and searching the web, including StackOverflow).
Could someone tell me? Or am I on the wrong track entirely?
That doesn't look like PostScript to me, it looks like PDF. Since you are reading a PDF file that's hardly surprising! :-)
Since its not PostScript, it won't have a prolog with definitions of the procedures.
You can find the operator definitions in the PDF Reference Manual which can be found with a Google search. Don't read the ISO specification (which you shouldn't be able to get anyway, since its copyrighted and has to be paid for), read the Adobe specification instead, its easier.
FWIW q is gsave (and Q is grestore) while cm is concat matrix (ie concat). RG is setrgbcolor. w, j and J set entries in the graphics state for linewidth, linecap and linejoin and gs is set an extended graphics state, which doesn't really have a PostScript equivalent.