Search code examples
pythontextnlp

Remove all chapters and corresponding titles in .txt file in python


I am working on a book in .txt format in python.

I would like to remove all chapters and corresponding titles. All of them are introduced by means of the word, CHAPTER, as in the example below:

\n\n\n\nCHAPTER 2. I OBSERVE\n\n\n

All have four \n\n\n\n before the UPPERCASE word CHAPTER, but after the chapter title, the numbers of \ n vary. So the condition I would like to impose is: Whenever \n\n\n\nCHAPTER is found, it deletes the text up to and including the next \ n.

\n\n\n\nCHAPTER 2. I OBSERVE\n\n\n -----> \n\n


Solution

  • Try this:

    import re
    
    with open('book.txt', 'r') as f:
        text = ''.join(f.readlines())
    
    text = re.sub(r'\n{4}CHAPTER.*\n+', '\n\n', text)
    
    with open('book.txt', 'w') as f:
        f.write(text)
    

    It matches all sequences of:

    • 4 newlines (\n{4})
    • the text "CHAPTER"
    • followed by some title text (.*)
    • and any amount of newlines thereafter (\n+)

    and replaces them with two newlines (\n\n).

    Note: this code overwrites the original file with the modifications. You may want to write to a different file & keep the original for different analysis. Otherwise, you can just directly pass the data to your NLP library.