I've had a look at similar topics, but no solution I can find exactly compares to what I'm trying to achieve.
I have a cipher text that needs to undergo a simple letter substitution based on the frequency of each letter's occurrence in the text. I already have a function to normalise the text (lowercase, no none-letter characters, no , count letter occurrences and then get the relative frequency of each letter. The letter is the key in a dictionary, and the frequency is the value.
I also have the expected letter frequencies for A-Z in a separate dictionary (k=letter, v=frequency), but i'm a bit befuddled by what to do next.
What I think I need to do is to take the normalised cipher text, the expected letter freq dict [d1] and the cipher letter freq dict [d2] and iterate over them as follows (part psuedocode):
for word in text:
for item in word:
for k,v in d2.items():
if d2[v] == d1[v]:
replace any instance of d2[k] with d1[k] in text
decoded_text=open('decoded_text.txt', 'w')
decoded_text.write(str('the decoded text')
Here, I want to take text and say "if the value in d2 matches a value in d1, replace any instance of d2[k] with d1[k] in text".
I realise i must have made a fair few basic python logic errors there (I'm relatively new at Python), but am I on the right track?
Thanks in advance
Update:
Thank you for all the helpful suggestions. I decided to try Karl Knechtel's method, with a few alterations to fit in my code. However, i'm still having problems (entirely in my implementation)
I have made a decode function to take the ciphertext file in question. This calls the count function previously made, which returns a dictionary (letter:frequency as a float). This meant that the "make uppercase version" code wouldn't work, as k and v didn't were floats and couldn't take .upper as an attribute. So, calling this decode function returns the ciphertext letter frequencies, and then the ciphertext itself, still encoded.
def sorted_histogram(a_dict):
return [x[1] for x in sorted(a_dict.items(), key=itemgetter(1))]
def decode(filename):
text=open(filename).read()
cipher=text.lower()
cipher_dict=count(filename)
english_histogram = sorted_histogram(english_dict)
cipher_histogram = sorted_histogram(cipher_dict)
mapping = dict(zip(english_histogram, cipher_histogram)
translated = ''.join(
mapping.get(c, c)
for c in cipher
)
return translated
You don't really want to do what you're thinking of doing, because the frequencies of characters in the sample won't, in general, match the exact frequency distribution in the reference data. What you're really trying to do is find the most common character and replace it with 'e', the next most and replace it with 't', and so on.
So what we're going to do is the following:
(I assume you can already do this part) Construct a dictionary of actual letter frequency in the ciphertext.
We define a function that takes a {letter: frequency} dictionary and produces a list of the letters in order of frequency.
We get the letters, in order of frequency, in our reference (i.e., now we have an ordered list of the most common letters in English), and in the sample (similarly).
On the assumption that the most common letter in the sample corresponds to the most common letter in English, and so on: we create a new dictionary that maps letters from the first list into letters from the second list. (We could also create a translation table for use with str.translate
.) We'll make uppercase and lowercase versions of the same dictionary (I'll assume your original dictionaries have only lowercase) and merge them together.
We use this mapping to translate the cipher text, leaving other characters (spaces, punctuation, etc.) alone.
Thus:
# 2.
import operator
def sorted_histogram(a_dict):
return [
x[1] # the value
for x in sorted(a_dict.items(), key=operator.itemgetter(1))
# of each dict item, sorted by value (i.e. the [1] element of each item).
]
# 3.
english_histogram = sorted_histogram(english_dict)
cipher_histogram = sorted_histogram(cipher_dict)
# 4.
# Make the lowercase version
mapping = dict(zip(english_histogram, cipher_histogram))
# Make the uppercase version, and merge it in at the same time.
mapping.update(dict(
(k.upper(), v.upper()) for (k, v) in zip(english_histogram, cipher_histogram)
))
# 5.
translated = ''.join( # make this list of characters, and string them together:
mapping.get(c, c) # the mapped result, if possible; otherwise the original
for c in cipher
)
# 6. Do whatever you want with 'translated' - write to file, etc.