Search code examples
pythoncomputer-sciencegoogle-colaboratory

Showing the Word Count for Each Word


I am having a hard time conducting a top 15 word count (Word Count for each word) for the document, Wuthering Heights (https://www.gutenberg.org/files/768/768.txt) on Google Colab. It can only include the words that start after “ccx074@pglaf.org” and end before “END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS. This is the coding that I tried.

file = open(768.txt,'r+')
wordcount = {}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] +=1
for k,v in wordcount.items():
    print(k,v)

Solution

  • With the help from string punctuation and operator itemgetter, this could be an approach. This will get close. Note that removing the punctuation will eliminate ending (.!?), to get clean words. (Also removes apostrophes (which you probably don't want to remove)

    from collections import Counter
    from string import punctuation
    from operator import itemgetter
    
    d = Counter()
    
    with open('wuthering_heights.txt', 'r') as f:
        opening = False
    
        for line in f:
            if line.startswith('ccx074@pglaf.org'):
                opening = True
            if opening == False:
                continue
            if line.startswith('CHAPTER'): # don't count chapter headings
                continue
            if line.startswith('***END OF THE PROJECT GUTENBERG EBOOK'):
                break
            
            line = line.strip()
            if len(line) == 0:
                continue
            
            # clean out punctuation
            line = line.translate(str.maketrans('','',punctuation))
            
            d.update(line.lower().split())
    
            
    
    print('different words count', len(d)        )
    #print(d.most_common(15))
    
    for word, count in reversed(sorted(d.items(), key=itemgetter(1))):
        print(word, count)
        if count < 290:
            break
    

    This prints:

    different words count 10098
    and 4693
    the 4552
    i 3530
    to 3476
    a 2301
    of 2221
    he 1922
    you 1712
    her 1544
    in 1459
    his 1419
    it 1284
    she 1269
    that 1188
    was 1124
    my 1098
    me 1047
    not 932
    as 931
    him 917
    for 836
    on 809
    with 804
    at 783
    be 724
    had 687
    but 673
    is 649
    have 629
    from 485
    by 451
    would 442
    if 440
    heathcliff 413
    your 404
    no 384
    said 368
    so 357
    were 354
    linton 340
    catherine 333
    an 317
    we 311
    mr 309
    or 307
    when 307
    out 305
    what 301
    are 295
    this 290
    they 283