Search code examples
pythoncryptographyn-gramfrequency-analysisvigenere

Using python to analyse bigrams in a string of text


I am trying to use python to help me crack Vigenère ciphers. I am fairly new to programming but I've managed to make an algorithm to analyse bigram frequencies in a string of text. This is what I have so far:

import nltk, string
from nltk import bigrams

Ciphertext = str(input("What is the text to be analysed?"))

#Removes spacing and punctuation to make the text easier to analyse
def Remove_Formatting(str):
    str = str.upper()
    str = str.strip()
    str = str.replace(' ','')
    str = str.translate(str.maketrans({a:None for a in string.punctuation}))
    return str

Ciphertext = Remove_Formatting(Ciphertext)

#Score is meant to increase if most common bigrams are in the text
def Bigram(str):
    Common_Bigrams = ['TH',        'EN',        'NG',
                      'HE',        'AT',        'AL',
                      'IN',        'ED',        'IT',
                      'ER',        'ND',        'AS',
                      'AN',        'TO',        'IS',
                      'RE',        'OR',        'HA',
                      'ES',        'EA',        'ET',
                      'ON',        'TI',        'SE',
                      'ST',        'AR',        'OU',
                      'NT',        'TE',        'OF']
    Bigram_score = int(0)
    for bigram in str:
        if bigram in Common_Bigrams:
            Bigram_score += 1
            return Bigram_score

Bigram(Ciphertext)

print (Bigram_score)

However when I try to run with a text this I get this error:

Traceback (most recent call last):
  File "C:/Users/Tony/Desktop/Bigrams.py", line 36, in <module>
    print (Bigram_score)
NameError: name 'Bigram_score' is not defined

What does this mean? I thought I have already defined Bigram_score as a variable, and I've tried everything but it still returns as an error this way or that. What have I done wrong? Please help...

Thanks in advance,

Tony


Solution

  • You could make Bigram_score global, like this:

    def Bigram(string): # don't override str
        global Bigram_score
        Common_Bigrams = ['TH',        'EN',        'NG',
                          'HE',        'AT',        'AL',
                          'IN',        'ED',        'IT',
                          'ER',        'ND',        'AS',
                          'AN',        'TO',        'IS',
                          'RE',        'OR',        'HA',
                          'ES',        'EA',        'ET',
                          'ON',        'TI',        'SE',
                          'ST',        'AR',        'OU',
                          'NT',        'TE',        'OF']
        Bigram_score = 0 # that 0 is an integer is implicitly understood
        for bigram in string:
            if bigram in Common_Bigrams:
                Bigram_score += 1
                return Bigram_score
    

    You could also bind the returned result from the Bigram function to a variable, like this:

    Bigram_score = Bigram(Ciphertext)
    
    print(Bigram_score)
    

    or:

    print(Bigram(Ciphertext))
    

    When you assign values to variables in a function, they are local and bound to that function. If a function returns anything, the returned value must be bound to a variable to be reused properly (or used directly).

    This is an example of how it works:

    spam = "spam" # global spam variable
    
    def change_spam():
        spam = "ham" # setting the local spam variable
        return spam
    
    change_spam()
    print(spam) # prints spam
    
    spam = change_spam() # here we assign the returned value to global spam
    print(spam) # prints ham
    

    In addition, your for loop loops over unigrams instead of bigrams. Let us take a closer look:

    for x in "hellothere":
        print(x)
    

    This will print unigrams. We therefore rename the bigram variable in your code to see where there are some logical problems.

    for unigram in string:
        if unigram in Common_Bigrams:
            print("bigram hit!")
    

    Since there are no unigrams that are identical with any bigrams, "bigram hit!" will never be printed. We could try to get bigrams with a different approach, using a while loop and an index number.

    index = 0
    n = 2 # for bigrams
    while index < len(string)-(n-1): # minus the length of n-1 (n-grams)
        ngram = string[index:index+n] # collect ngram
        index += 1 # important to add this, otherwise the loop is eternal!
        print(ngram)
    

    Next, just include in the loop what you want to do with the bigram.