Search code examples
pythonhtmlregexbeautifulsoupstartswith

How to split a txt file into multiple files excluding lines with certain content


I have a large .txt file that I want to split into multiple, smaller .txt files, so I'm left with readable paragraphs in each smaller .txt file.

However what I want to do is exclude certain parts of the source file from being written to a smaller file. (ie if line doesn't start with <p> then don't write to file).

Here is the code I have - which works fine, except it generates some files I don't want:

import mmap
import re

filenumber = 0

out_file = None

with open('main.txt') as x:
    for line in x:
        if line.strip() == '<p>':
             filenumber += 1
            out_file = open('narrative%03d.txt' % filenumber, 'w')
        elif line.strip().startswith('</p>') and out_file:
            out_file.close()
            out_file = None
        elif out_file:
            out_file.write(line)
if out_file:
    out_file.close()

What I would like to do is figure out a way of saying - run the code, but if a line starts doesn't start with <p> then do do nothing, and continue with the rest of the code.

Any help would be greatly appreciated! Please let me know if I haven't provided enough info!

As the source file contains html tags the easiest way for me to show you the source file is to provide a link to it:

https://archive.org/stream/warandpeace030164mbp/warandpeace030164mbp_djvu.txt

View source to see the bits I don't want including.

I just want the paragraphs from the book-

ie

His daughter, Princess He*lene, passed be- tween the chairs, lightly holding up the folds of her dress, and the smile shone still more radiantly on her beautiful face. Pierre gazed at her with rapturous, almost frightened, eyes as she passed him.

"Very lovely," said Prince Andrew.

I don't want the beginning of the doc which includes all the html and chapter listings etc.


Solution

  • For the link you have provided, the whole of the text is contained within a single huge <pre>...</pre> block. As such you could easily extract it using BeautifulSoup.

    First grab the html using something like requests, extract the text containing the single pre using BeautifulSoup, then split the text up based on double newlines and remove any empty entries:

    from bs4 import BeautifulSoup
    import requests
    
    html = requests.get('https://archive.org/stream/warandpeace030164mbp/warandpeace030164mbp_djvu.txt')
    soup = BeautifulSoup(html.text, "lxml")
    war_and_peace = soup.pre.get_text()
    
    paragraphs = war_and_peace.split('\n\n')
    paragraphs[:] = [p for p in paragraphs if len(p)]       # Remove empty entries
    
    print paragraphs[671]
    

    The result would be a list of paragraphs. The script would display the following:

    His daughter, Princess He*lene, passed be- 
    tween the chairs, lightly holding up the folds 
    of her dress, and the smile shone still more 
    radiantly on her beautiful face. Pierre gazed 
    at her with rapturous, almost frightened, eyes 
    as she passed him.