Search code examples
regexstringlistsubstringcounter

Why are string conversions slowing down my code?


I am new to python and wrote and code that:

  • Reads a text file.
  • Saves in a list.
  • Performs regex.
  • Converts list to string
  • Removes unwanted special characters.
  • Puts the stuff back to a list.
  • uses a counter on the list and then packs them into a dictionary.
  • Finally plotting the keys and values using Pandas.

As you can my Python experience is pretty low. My code works perfect for smaller files but when I use something like a 700 MB file it seems to be running forever!

How can I optimize my code?

Here is my input file format.

74M2S
73M
74M2S
*
73M
75M1S

Here is my code:

import matplotlib.pyplot as plt
import re
import pandas as pd
from collections import Counter

f = open('/PathTpFile/MyFILE.txt','r+')

listToStr:  str
str2:   str
mylist1 = []

for line in f.readlines():

    mylist1.append([re.findall(r'[\d]+M', line)])    
    mylist1.sort(reverse=True)
    listToStr = ' '.join(map(str, mylist1))

    specialChars = "M[]'"
    for specialChar in specialChars:
        listToStr = listToStr.replace(specialChar, '')

    words: list = listToStr.split()

counts = Counter(words)
dict(counts)
print(counts)

f.close()

keys = counts.keys()
values = counts.values()
print(counts.keys())
print(counts.values())
plt.bar(keys, values)
plt.savefig("out.png")

Solution

    1. Don't read line by line, read the entire file.
    2. Use re.findall() on the entire file to get all the matching numbers.
    3. Use a regexp that just returns the numbers, so you don't need to use re.sub() to remove the extra characters. You can use a lookahead to match the M without including it in the results.
    4. There's no need to sort the words before counting.
    with open('/PathTpFile/MyFILE.txt','r') as f:
        text: str = f.read()
    
    mylist1: list = re.findall(r'\d+(?=M)', text)
    counts: dict = Counter(mylist1)