Search code examples
pythonloopsdata-cleaning

Cleaning up a Table of Contents to extract just the Titles using Python?


I'm working on an academic research project that requires extracting titles from a Table of Contents. I'm making a Python program to clean up text that looks like this:

BONDS OF LATE:
An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
An act to provide for publishing a now edition of Dresses Reports ..................................... 78

BRIDGES:
An act to provide for the better protection of the public bridges in this State ........................... 74

to look like this:

An act providing the officers of the State of Illinois from making payments on certain bonds .

An act to provide for publishing a now edition of Dresses Reports .

An act to provide for the better protection of the public bridges in this State .

My strategy is to somehow iterate through a text file and delete characters after the first '.' and before the next 'An act'. I thought about trying a nested 'for' loop like this:

for line in file:
    for character in line:

But iterating by character makes it impossible to stop at a string (i.e. 'An act'). I'm a beginner to Python (and coding) and would greatly appreciate any help. Are there regular expressions that would help delete all the characters in a line before 'An act' and after the first period? Thank you!


Solution

  • You can use a regular expression that matches lines that start with "An act", followed by a space and at least one character, followed by a period (see this regex101 for more in-depth explanation). We use the non-greedy operator to stop at the first period, and we use ?: to indicate that there's a group that we don't care about capturing:

    import re
    
    with open("data.txt") as file:
        for line in file:
            search_result = re.search(r"^(An act (?:.+?)\.)", line)
            if search_result:
                print(search_result.group(1))
    

    This outputs:

    An act providing the officers of the State of Illinois from making payments on certain bonds .
    An act to provide for publishing a now edition of Dresses Reports .
    An act to provide for the better protection of the public bridges in this State .