python-pptx: Getting odd splits when extracting text from slides

I'm using the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.

from pptx import Presentation

prs = Presentation(path_to_presentation)

# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []

for slide in prs.slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue
        for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
                text_runs.append(run.text)

It seems to be working fine, except that I'm getting odd splits in some of the text_runs. Things that I'd expect would be grouped together are being split up, and with no obvious pattern that I can detect. For example, sometimes the slide title is split into two parts, and sometimes it isn't

I've discovered that I can eliminate the odd splits by retyping the text on the slide but that doesn't scale.

I can't, or at least don't want to, merge the two parts of the split text together, because sometimes the second part of the text has been merged with a different text run. For example, on the slide deck's title slide, the title will be split in two, with the second part of the title merged with the title slide's subtitle text.

Any suggestions on how to eliminate the odd / unwanted splits? Or is this behavior more-or-less to be expected when reading text from a PowerPoint?

Solution

I'd say it's definitely to be expected. PowerPoint will split runs whenever it pleases, perhaps to highlight a misspelled word or just if you pause in typing or go in to fix a typo or something.

The only thing that can be said for sure about a run is that all the characters it contains share the same character formatting. There's no guarantee, for example, that the run is what one might call "greedy", including as many characters as possible that do share the same character formatting.

If you want to reconstruct that "greedy" coherence in the runs, it will be up to you, perhaps with an algorithm like this:

last_run = None
for run in paragraph.runs:
    if last_run is None:
        last_run = run
        continue
    if has_same_formatting(run, last_run):
        last_run = combine_runs(last_run, run)
        continue
    last_run = run

That leaves you to implement has_same_formatting() and combine_runs(). There's a certain advantage here, because runs can contain differences you don't care about, like a dirty attribute or whatever, and you can pick and choose which ones matter to you.

A start of an implementation of has_same_formatting() would be:

def has_same_formatting(run, run_2):
    font, font_2 = run.font, run_2.font
    if font.bold != font_2.bold:
        return False
    if font.italic != font_2.italic:
        return False
    # ---same with color, size, type-face, whatever you want---
    return True

combine_runs(base, suffix) would look something like this:

def combine_runs(base, suffix):
    base.text = base.text + suffix.text
    r_to_remove = suffix._r
    r_to_remove.getparent().remove(r_to_remove)