Search code examples
pythonbashmemory-management

Python/Bash 'MemoryError': how can I make my script more efficient?


I have the following script which I have written for calculating statistics about textual corpora I analyze under a linguistics angle. However, the text files I analyze are relatively big for such processes (~3Gb, ~500M words), which is probably what makes my script inefficient given my current hardware (i5, 16Gb RAM). The 'MemoryError' I get is when I launch the script through the Terminal, so I must admit that I am unsure whether this is a Python of Bash error message, although I reckon that the implications are the same, but correct me if I'm wrong.

I am not a computer scientist, and so it is very likely that the tools I use are not the most adapted/efficient for the task, so would anyone have any recommendation to improve the script and make it able to handle such volumes of data? Please keep in mind that my tech/programming knowledge is relatively limited, being a linguist before all, so if you could explain the technical stuff with that in mind that would be awesome.

Thanks a lot in advance!

EDIT: here is the error message I get, as required by some of you:

"Traceback (most recent call last): File "/path/to/my/myscript.py", line 43, in keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt') File "/path/to/my/myscript.py", line 9, in calculate_keywords target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')) MemoryError

#!/usr/bin/env python3

import collections
import math
import string

def calculate_keywords(target, reference):
    with open(target, 'r') as f:
        target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        target_words = target_text.split()

    with open(reference, 'r') as f:
        reference_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        reference_words = reference_text.split()

    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)

    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())
    
    target_norm = {}
    reference_norm = {}

    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000

    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000

    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

Solution

  • Here is a working solution that's also very fast:

    #!/usr/bin/env python3
    
    import collections
    
    def words_in_line(line):
        return line.lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
     
    def get_counter(filename):
        Counter = collections.Counter()
        with open(filename) as file:
            for line in file:
                Counter.update(words_in_line(line))
        return Counter
    
    def get_norm(filename):
        c = get_counter(filename)
        total = sum(c.values())
        return {word: freq / total * 1_000_000 for word, freq in c.items()}
    
    def calculate_keywords(target, reference):
        target_norm = get_norm(target)
        reference_norm = get_norm(reference)
    
        smp_scores = {}
    
        for word, freq in target_norm.items():
            if word not in reference_norm:
                reference_norm[word] = 0
            s1 = freq + 100
            s2 = reference_norm[word] + 100
            smp_scores[word] = s1 / s2
    
        keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
        return keywords, target_norm, reference_norm, smp_scores
        
    
    keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
    for word in keywords:
        print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")