I'm trying to analyze texts from movie scripts and need a way to grab the specific character lines. Character lines are easily visible because they are always centered and formatted like a block quote. Here's an example.
So I would want to get characters' blocks of lines. However, when I read the pdf with something like pdfplumber, it doesn't specify that there was any difference in formatting there, so it will print out something like:
--
CLEMENTINE
God, yes. You've saved my life! Brrr!
The waitress pours the coffee.
WAITRESS
You know what you want?
--
I don't want the "The waitress pours the coffee," line to be clumped into the character's actual speaking lines. Is there anyway (using pdfplumber or any other module) that I could extract that centering/changed margins somehow? I don't know how else really to be able to specify that this text is different. It's easy to eyeball, but the program isn't grabbing the difference.
Thanks!
Unfortunatly in PDF compilations you can throw all human concept out of the pram.
ALL text is generally treated as an equal but some can be more so.
So there is no such thing as tabs, or centered since normally all lines are centered between their start and end.
SO how many of those justified lines are also centered?
However there is no flag for justification or aligned left or right those terms are meaningless to a printer it just blobs out big letters, small letters or letters that may look like ALL CAPS but there is no need for words in printing. Literally a PDF is just go here go there and put some characters or marks on the page.
If we load the URL for page 4 into a PDF editor we can see how it was constructed.
So it is unusual that the text is only ragged right (just like it would be from a line printer or typewriter), I had expected ragged left too. However in either case there is no way to differentiate any one text line from another. The typewritten face is naturally one height and thus only human intelligence can say what is dialogue and what is a stage direction.
So you ask how to tell the difference and the answer is clear, Luckily unlike other PDFs this one has a semblance of indentation (very rare). Built using Microsoft Word but following stagecraft conventions "Professional screenwriting software takes care of this by automatically tabbing down to a new line in dialogue. There may be small discrepancies between them but nothing to get too hung up on."
approaches her with a coffee pot.
CLEMENTINE
Hi, it's me again! My home away from...
It may vary from document to document but in the PDF copy linked above
++ 6 spaces is a Stage direction or slugline
& 27 spaces is a CHARACTER
& 17 spaces is a "Dialogue" but without any quote marks
However flies always turn up in the ointment, and here there are two characters, thus the characters are moved and the dialogue starts where a stage direction (Mixed Case) or slugline (ALL CAPS) would be expected.
She scrambles in the window. Joel looks around, panicked.
^ JOEL VOICE-OVER
^ (whisper) I couldn't believe you did
Clementine. that. I was paralyzed with
^ fear.