I have a movie script. My first job is to collect each character's lines in a dictionary.
Later I will need to put the data into a series.
Right now, I have all of the dialogue in a list, starting with the character names. It is formatted like this:
Dialogue[0] 'NAME1\n(16 whitespaces)YO, YO, good that you're here man.'
All of the names end with \n. And then all the lines of dialogue start with 16 whitespaces. I think this could be useful but im not sure how to make use of this.
I've tried a number of things but pretty much no luck.
result = {}
for lines in dialogue:
first_token = para.split()[0]
if first_token.endswith('\n'): #this would be the name
name, line = para.split(on the new line?)
name = name.strip()
if name not in result:
result[name] = []
result[name].append(line)
return result
This code gives me a whole load of errors, so i dont think its useful to list them here.
Ideally I need each character as the first key in the dictionary and then all of their lines as the data.
Something like this:
Name1:[Line1, Line2, Line3...] Name2:[Line1, Line2, Line3...]
EDIT: Some of the character names have two words
EDIT 2: Maybe it would be easier to go back to the original movie script text file.
It is formatted like this:
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
EDIT: added spaces in name regex, strip name whitespace
import re
lines = [
"Dialogue[0] 'NAME1 \n YO, YO, good that you're here man.'",
"Dialogue[1] 'NAME 1\n YO, YO, ",
"Dialogue[2] 'NAME2\n YO, YO, good that ",
"Dialogue[3] 'NAME2\n YO, YO, good that you're here'",
]
regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
lineslist = [re.findall(regex, line) for line in lines]
lineslist = [ match[0] for match in lineslist if len(match)]
keys = [l[0].strip() for l in lineslist]
result = {k:[] for k in set(keys)}
[result[l[0].strip()].append(l[1]) for l in lineslist]
result
Output:
{'NAME 1': ['YO, YO, '],
'NAME1': ["YO, YO, good that you're here man.'"],
'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}