Search code examples
pythonregextextnltkanalysis

Add character names and their lines to a new dictionary from array / list


I have a movie script. My first job is to collect each character's lines in a dictionary.

Later I will need to put the data into a series.

Right now, I have all of the dialogue in a list, starting with the character names. It is formatted like this:

Dialogue[0] 'NAME1\n(16 whitespaces)YO, YO, good that you're here man.'

All of the names end with \n. And then all the lines of dialogue start with 16 whitespaces. I think this could be useful but im not sure how to make use of this.

I've tried a number of things but pretty much no luck.

    result = {}
    for lines in dialogue:
        first_token = para.split()[0]
        if first_token.endswith('\n'): #this would be the name
            name, line = para.split(on the new line?)
            name = name.strip()
            if name not in result:
                result[name] = []
            result[name].append(line)
    return result

This code gives me a whole load of errors, so i dont think its useful to list them here.

Ideally I need each character as the first key in the dictionary and then all of their lines as the data.

Something like this:

Name1:[Line1, Line2, Line3...] Name2:[Line1, Line2, Line3...]

EDIT: Some of the character names have two words

EDIT 2: Maybe it would be easier to go back to the original movie script text file.

It is formatted like this:

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

Solution

    • split text lines
    • create dict with unique keys for each actor
    • add actors lines to dict

    EDIT: added spaces in name regex, strip name whitespace

    import re
    lines = [
        "Dialogue[0] 'NAME1 \n                YO, YO, good that you're here man.'",
        "Dialogue[1] 'NAME 1\n                YO, YO, ",
        "Dialogue[2] 'NAME2\n                YO, YO, good that ",
        "Dialogue[3] 'NAME2\n                YO, YO, good that you're here'",
    ]
    
    regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
    lineslist = [re.findall(regex, line) for line in lines]
    lineslist = [ match[0] for match in lineslist if len(match)]
    keys = [l[0].strip() for l in lineslist]
    result = {k:[] for k in set(keys)}
    [result[l[0].strip()].append(l[1]) for l in lineslist]
    result
    

    Output:

    {'NAME 1': ['YO, YO, '],
     'NAME1': ["YO, YO, good that you're here man.'"],
     'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}