I am helping a professor of mine with a research project that involves pulling one thousand sentences randomly from a set of 20 text files. This is all data from the Corpus of Contemporary American English, if anyone is familiar with working with that. In these text files, the data is arranged like so:
Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .
Blockquote>
So, there are hundreds of paragraphs, each starting with a six digit number preceded by "##". That number corresponds to the source where the sentences were drawn from. I need to pull random sentences from these files, and also get the six digit number identifying their source with them. So ideally, I would get something like:
Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .
I have succeeded in getting random sentences from the files (with some help from the kind souls here at stackoverflow), but I don't know how to get the number attached to them (for example, if I pull a sentence from the middle of a paragraph, how would I be able to get the number from the start of the paragraph). Can anyone help me think of a way to do this? This is the code I have so far, which successfully extracts sentences.
# -*- coding: utf-8 -*-
import re
from random import sample
sentences = []
for i in range(1990,2013):
with open('w_acad_{}.txt'.format(i)) as f:
sentences += re.findall(r".*?[\.\!\?]+", f.read())
selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
f.write('\n'.join(selected))
In general, to avoid loading (potentially large) files into memory all at once, you could use a reservoir sampling algorithm -- just pass it an iterator that yields labeled (with the ##
-numbers) sentences:
#!/usr/bin/env python
import re
import nltk # $ pip install nltk
def paragraphs(file):
"""Yield blank-line separated paragraphs labeled with ##-numbers."""
lines = []
for line in file:
if line.strip():
lines.append(line)
elif lines: # blank line, the end of a non-empty paragraph
paragraph = ''.join(lines)
numbers = re.findall(r'##([0-9]+)', paragraph) # only ASCII-digits
assert len(numbers) == 1 # only one ##-number per paragraph
yield int(numbers[0]), paragraph
del lines[:]
def sentences(filenames):
for filename in filenames:
with open(filename) as file:
for number, paragraph in paragraphs(file):
for sentence in nltk.sent_tokenize(paragraph):
yield number, sentence
filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))
where reservoir_sample()
is defined here.
nltk.sent_tokenize()
may be a more robust solution than the r".*?[\.\!\?]+"
regular expression.