Determine ROT encoding

I want to determine which type of ROT encoding is used and based off that, do the correct decode.

Also, I have found the following code which will indeed decode rot13 "sbbone" to "foobart" correctly:

import codecs
codecs.decode('sbbone', 'rot_13')

The thing is I'd like to run this python file against an existing file which has rot13 encoding. (for example rot13.py encoded.txt).

Thank you!

Solution

To answer the second part of your first question, decode something in ROT-x, you can use the following code:

def encode(s, ROT_number=13):
    """Encodes a string (s) using ROT (ROT_number) encoding."""
    ROT_number %= 26  # To avoid IndexErrors
    alpha = "abcdefghijklmnopqrstuvwxyz" * 2
    alpha += alpha.upper()
    def get_i():
        for i in range(26):
            yield i  # indexes of the lowercase letters
        for i in range(53, 78):
            yield i  # indexes of the uppercase letters
    ROT = {alpha[i]: alpha[i + ROT_number] for i in get_i()}
    return "".join(ROT.get(i, i) for i in s)


def decode(s, ROT_number=13):
    """Decodes a string (s) using ROT (ROT_number) encoding."""
    return encrypt(s, abs(ROT_number % 26 - 26))

To answer the first part of your first question, find the rot encoding of an arbitrarily encoded string, you probably want to brute-force. Uses all rot-encodings, and check which one makes the most sense. A quick(-ish) way to do this is to get a space-delimited (e.g. cat\ndog\nmouse\nsheep\nsay\nsaid\nquick\n... where \n is a newline) file containing most common words in the English language, and then check which encoding has the most words in it.

with open("words.txt") as f:
    words = frozenset(f.read().lower().split("\n"))
    # frozenset for speed
def get_most_likely_encoding(s, delimiter=" "):
    alpha = "abcdefghijklmnopqrstuvwxyz" + delimiter
    for punctuation in "\n\t,:; .()":
        s.replace(punctuation, delimiter)
    s = "".join(c for c in s if c.lower() in alpha)
    word_count = [sum(w.lower() in words for w in encode(
            s, enc).split(delimiter)) for enc in range(26)]
    return word_count.index(max(word_count))

A file on Unix machines that you could use is /usr/dict/words, which can also be found here