I am new to python and wrote and code that:
As you can my Python experience is pretty low. My code works perfect for smaller files but when I use something like a 700 MB file it seems to be running forever!
How can I optimize my code?
Here is my input file format.
74M2S
73M
74M2S
*
73M
75M1S
Here is my code:
import matplotlib.pyplot as plt
import re
import pandas as pd
from collections import Counter
f = open('/PathTpFile/MyFILE.txt','r+')
listToStr: str
str2: str
mylist1 = []
for line in f.readlines():
mylist1.append([re.findall(r'[\d]+M', line)])
mylist1.sort(reverse=True)
listToStr = ' '.join(map(str, mylist1))
specialChars = "M[]'"
for specialChar in specialChars:
listToStr = listToStr.replace(specialChar, '')
words: list = listToStr.split()
counts = Counter(words)
dict(counts)
print(counts)
f.close()
keys = counts.keys()
values = counts.values()
print(counts.keys())
print(counts.values())
plt.bar(keys, values)
plt.savefig("out.png")
re.findall()
on the entire file to get all the matching numbers.re.sub()
to remove the extra characters. You can use a lookahead to match the M
without including it in the results.with open('/PathTpFile/MyFILE.txt','r') as f:
text: str = f.read()
mylist1: list = re.findall(r'\d+(?=M)', text)
counts: dict = Counter(mylist1)