python-3.x text nlp corpus data-preprocessing

Read and Write large text file python too slow

This code goes over a large 5.1GB text file and checks if there are words that appear less than 100 times . Then rewrites the 5.1GB into an output text file and replaces those words with unk. The main problem is that the creation of output.txt is taking a very long time. I suspect that the method write_text() is causing an issue by the way its opening the dataset file and the output file.

The goal behind this script : I have a prebuilt vocab and i have a text. The text might have new words that are not in my vocab so i would like to add them to my vocab. But I only want to add new words that are relevant (appears more than 100 times). The new words that appear in the text less than 100 times are disposable and not important so i would like to change them to "unk".


from collections import Counter

extra_words = []
new_words = []
add_words = []


def get_vocab():
    vocab = set()
    with open('vocab.txt', 'r', encoding='utf-8') as rd:
        lines = rd.readlines()

    for line in lines:
        tokens = line.split(' ')
        word = tokens[0]
        vocab.add(word)

    return vocab


def _count(text):

    vocab = get_vocab()

    with open(text, 'r', encoding='utf-8') as fd:

        for line in fd.readlines():

            for token in line.split():

                if token not in vocab:
                    extra_words.append(token)

    word_count = Counter(extra_words)

    # add del word_count[punctuation] to remove it from list

    #del word_count['"']

    for word in word_count:

        if word_count[word] < 100:
            new_words.append(word)

        else:
            add_words.append(word)

    write_text()

    #return len(new_words), word_count.most_common()[0]


def write_text():

    with open('dataset', 'r', encoding='utf-8') as fd:

        f = fd.readlines()

    with open('output.txt', 'w', encoding='utf-8') as rd:
        new_text = []
        for line in f:
            new_line = []
            for token in line.split():

                

                if token in new_words:

                    new_line.append('<unk>')

                else:

                    new_line.append(token)

            new_text.append(' '.join(new_line))
        print('\n'.join(new_text), file=rd)
            #print(' '.join(new_line), file=rd)


def add_vocab():

    ln = len(get_vocab())

    with open('vocab.txt', 'w', encoding='utf-8') as fd:

        for idx, word in add_words:

            print(f'{word} {ln + idx + 1}\n', file=fd)

    pass


print(_count('dataset'))
add_vocab()

Solution

I tested this with the complete works of Shakespeare. You have a bunch of work still ahead of you related to case and punctuation. It does 100 copies of his works (500meg) in about 15 seconds for me. You might want to look at profiling your code if this takes more an an unacceptable time. Note that I used a simplified version of your vocabulary file as I did not follow what you wanted to see in it. The version I used is just words line by line.

import collections

def get_vocabulary(path):
    with open(path, 'r', encoding='utf-8') as file_in:
        tokens = [line.strip("\n") for line in file_in]
    return set(tokens)

def get_interesting_word_counts(path, vocabulary):
    word_counts = collections.Counter()
    with open(path, 'r', encoding='utf-8') as file_in:
        for line in file_in:
            word_counts.update([token for token in line.split() if token not in vocabulary])
    return word_counts

def get_cleaned_text(path, vocabulary, uncommon_words):
    with open(path, 'r', encoding='utf-8') as file_in:
        for line in file_in:
            #line_out = " ".join(["<unk>" if token in uncommon_words else token for token in line.strip("\n").split()])
            line_out = " ".join([
                token if token in vocabulary or token not in uncommon_words else "<unk>"
                for token in line.strip("\n").split()
            ])
            yield "{}\n".format(line_out)

vocabulary = get_vocabulary("vocabulary.txt")
word_counts = get_interesting_word_counts("shakespeare.txt", vocabulary)

## --------------------------------------
## Add frequent but missing words to vocabulary
## --------------------------------------
common_words = set([item[0] for item in word_counts.items() if item[1] >= 100])
with open('vocabulary.txt', 'a', encoding='utf-8') as file_out:
    for word in common_words:
        file_out.write("{}\n".format(word))
## --------------------------------------

## --------------------------------------
## Rewite the text censuring uncommon words
## --------------------------------------
uncommon_words = set([item[0] for item in word_counts.items() if item[1] < 100])
cleaned_text = get_cleaned_text("shakespeare.txt", vocabulary, uncommon_words)
with open('shakespeare_out.txt', 'w', encoding='utf-8') as file_out:
    file_out.writelines(cleaned_text)
## --------------------------------------

You can get the text I used here: http://www.gutenberg.org/ebooks/100

The source begins:

The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare

The resulting file begins:

<unk> <unk> <unk> <unk> of The <unk> <unk> of <unk> <unk> by <unk> <unk>

The updated vocabulary file begins:

as
run
he’s
this.
there’s
like
you.