I am having a hard time conducting a top 15 word count (Word Count for each word) for the document, Wuthering Heights (https://www.gutenberg.org/files/768/768.txt) on Google Colab. It can only include the words that start after “ccx074@pglaf.org” and end before “END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS. This is the coding that I tried.
file = open(768.txt,'r+')
wordcount = {}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] +=1
for k,v in wordcount.items():
print(k,v)
With the help from string punctuation
and operator itemgetter
, this could be an approach. This will get close. Note that removing the punctuation will eliminate ending (.!?), to get clean words. (Also removes apostrophes (which you probably don't want to remove)
from collections import Counter
from string import punctuation
from operator import itemgetter
d = Counter()
with open('wuthering_heights.txt', 'r') as f:
opening = False
for line in f:
if line.startswith('ccx074@pglaf.org'):
opening = True
if opening == False:
continue
if line.startswith('CHAPTER'): # don't count chapter headings
continue
if line.startswith('***END OF THE PROJECT GUTENBERG EBOOK'):
break
line = line.strip()
if len(line) == 0:
continue
# clean out punctuation
line = line.translate(str.maketrans('','',punctuation))
d.update(line.lower().split())
print('different words count', len(d) )
#print(d.most_common(15))
for word, count in reversed(sorted(d.items(), key=itemgetter(1))):
print(word, count)
if count < 290:
break
This prints:
different words count 10098
and 4693
the 4552
i 3530
to 3476
a 2301
of 2221
he 1922
you 1712
her 1544
in 1459
his 1419
it 1284
she 1269
that 1188
was 1124
my 1098
me 1047
not 932
as 931
him 917
for 836
on 809
with 804
at 783
be 724
had 687
but 673
is 649
have 629
from 485
by 451
would 442
if 440
heathcliff 413
your 404
no 384
said 368
so 357
were 354
linton 340
catherine 333
an 317
we 311
mr 309
or 307
when 307
out 305
what 301
are 295
this 290
they 283