Search code examples
pythonnlppython-docx

Python chunk list from one element to another


I've got the following code:

for paragraph in document.paragraphs:
while paragraph.style.name == 'Heading 2':
    print(paragraph.style.name)
    print(paragraph.text)

This basically doesn't work because I don't know how to accommodate the right logic. I'm using python docx library https://python-docx.readthedocs.io/en/latest/user/styles-using.html to iterate through the document's paragraphs.

Now, I want to split the list of paragraphs into sublists starting from every Heading 2, then adding all the next paragraphs with different paragraph.style.name until the next Heading 2 element, so that each chunk will contain one Heading 2 paragraph with its corresponding text.

In other words, I'm looking for a way to split the list into chunks from one element to another. Please help :)


Solution

  • You could use an itertools.groupby to accomplish this:

    from itertools import groupby
    
    groups, next_group = [], []
    
    for k, group in groupby(document.paragraphs, lambda x: x.style.name == 'Heading 2'):
        # If the predicate is True and next_group is populated,
        # we create a new chunk
        if k and next_group:
            groups.append(next_group)
            next_group = []
    
        # Fill up the current chunk
        for paragraph in group:
            # feel free to swap this out with a print statement
            # or whatever data structure suits you
            next_group.append({'style_name': paragraph.style.name, 'text': paragraph.text})
    

    I'm using a list of dictionaries here for clarity, but you can substitute any data structure