I'm trying to count items in a list of strings using count()
function and sorting the results from largest to smallest. Although the function performs reasonably well on small lists, it does not scale up well at all, as can be seen in the small experiment below with just 5 cycles of doubling up the input length (the 6th cycle was taking too long to wait). Is there a way to optimize the first list comprehension or perhaps an alternative to count()
that would scale up better?
import nltk
from operator import itemgetter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."
unigrams = nltk.word_tokenize(t.lower())
for size in range(1, 6):
unigrams = unigrams*size
start = time.time()
unigram_freqs = [unigrams.count(word) for word in unigrams]
freq_pairs = set((zip(unigrams, unigram_freqs)))
freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]
end = time.time()
time_elapsed = round(end-start, 3)
print("Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
# Runtime: 0.001s for 1x the size
# Runtime: 0.003s for 2x the size
# Runtime: 0.022s for 3x the size
# Runtime: 0.33s for 4x the size
# Runtime: 8.065s for 5x the size
Using Counter from collections and sorting by means of the member function "most_common()" I get pretty much 0 seconds regardless of size:
import nltk
nltk.download('punkt')
from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."
unigrams = nltk.word_tokenize(t.lower())
for size in range(1, 5):
unigrams = unigrams*size
start = time.time()
unigram_freqs = [unigrams.count(word) for word in unigrams]
freq_pairs = set((zip(unigrams, unigram_freqs)))
freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]
end = time.time()
time_elapsed = round(end-start, 3)
print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
start = time.time()
a = Counter(unigrams).most_common()
#print(a)
end = time.time()
time_elapsed = round(end-start, 3)
print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
Slow Runtime: 0.003s for 1x the size
Fast Runtime: 0.0s for 1x the size
Slow Runtime: 0.006s for 2x the size
Fast Runtime: 0.0s for 2x the size
Slow Runtime: 0.157s for 3x the size
Fast Runtime: 0.0s for 3x the size
Slow Runtime: 1.891s for 4x the size
Fast Runtime: 0.001s for 4x the size