How to extract text from powerpoint text boxes, in their order within the presentation using python-pptx.

My PowerPoint slide consists of text boxes, sometimes inside group shapes. When extracting data from these, the text isn't extracted in an order. Sometimes the textbox at the end of the ppt is extracted first and sometimes the ones in the middle and so on.

The following code gets text from textboxes and handles group objects too.

for eachfile in files:    
    prs = Presentation(eachfile)
    textrun=[]
    # ---Only on text-boxes outside group elements---
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)

        # ---Only operate on group shapes---
        group_shapes = [shp for shp in slide.shapes 
                        if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    print(shape.text)
                    textrun.append(shape.text)
    new_list=" ".join(textrun)
    text_list.append(new_list)

print(text_list)

I would like to filter some of the data extracted based on their order of appearance in the slide. On what basis does the function decide the order? What should be done to solve this problem?

Solution

Steve's comment is quite right; the shapes returned by:

for shape in slide.shapes:
    ...

Are in document order of the underlying XML, which is also what establishes z-order. Z-order is the "stacking" order, as if each shape was on a separate transparent sheet (layer), with the first returned shape on the bottom and each subsequent shape added to the top of the stack (and overlapping any beneath it).

I think what you're after here is something like left-to-right, top-to-bottom. You'll need to write your own code to sort the shapes in this order, using shape.left and shape.top.

Something like this might do the trick:

def iter_textframed_shapes(shapes):
    """Generate shape objects in *shapes* that can contain text.

    Shape objects are generated in document order (z-order), bottom to top.
    """
    for shape in shapes:
        # ---recurse on group shapes---
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            group_shape = shape
            for shape in iter_textframed_shapes(group_shape.shapes):
                yield shape
            continue

        # ---otherwise, treat shape as a "leaf" shape---
        if shape.has_text_frame:
            yield shape

textable_shapes = list(iter_textframed_shapes(slide.shapes))
ordered_textable_shapes = sorted(
    textable_shapes, key=lambda shape: (shape.top, shape.left)
)

for shape in ordered_textable_shapes:
    print(shape.text)