I am working on a book in .txt format in python.
I would like to remove all chapters and corresponding titles. All of them are introduced by means of the word, CHAPTER, as in the example below:
\n\n\n\nCHAPTER 2. I OBSERVE\n\n\n
All have four \n\n\n\n before the UPPERCASE word CHAPTER, but after the chapter title, the numbers of \ n vary. So the condition I would like to impose is: Whenever \n\n\n\nCHAPTER is found, it deletes the text up to and including the next \ n.
\n\n\n\nCHAPTER 2. I OBSERVE\n\n\n -----> \n\n
Try this:
import re
with open('book.txt', 'r') as f:
text = ''.join(f.readlines())
text = re.sub(r'\n{4}CHAPTER.*\n+', '\n\n', text)
with open('book.txt', 'w') as f:
f.write(text)
It matches all sequences of:
\n{4}
).*
)\n+
)and replaces them with two newlines (\n\n
).
Note: this code overwrites the original file with the modifications. You may want to write to a different file & keep the original for different analysis. Otherwise, you can just directly pass the data to your NLP library.