Search code examples
python-2.7encodingxorhamming-distance

Breaking XOR repeated key


I want to break XOR repeated key, I dont now anything about the key nor the message, only thing I know that it is using repeated key. Encoded message s beenbase64'd after being encrypted with repeating-key XOR so I converted base 64 to base16 first so it is easier. I have instructions but I don't understand this very good.

  1. Let KEYSIZE be the guessed length of the key; try values from 2 to (say) 40. Write a function to compute the edit distance/Hamming distance between two strings.

  2. For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE.

  3. The KEYSIZE with the smallest normalized edit distance is probably the key. You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances.

Now that you probably know the KEYSIZE: break the ciphertext into blocks of KEYSIZE length, etc, I got this and the rest fine, for now, I should now exactly when I found out if this is good and try to decode..

I wrote a code for this in Python, it is working, but I am not completely sure if I have done this correctly

    def compute_distance(str1,str2,keysize):
      count=0
      str1=str1.replace("\n", "")
      str2=str2.replace("\n", "")
      keysize=str(keysize*8)
      sbin1=format(int(str1,16),'0'+keysize+'b')
      sbin2=format(int(str2,16),'0'+keysize+'b')

      for c1,c2 in zip(sbin1, sbin2):
        if  c1!=c2:
          count+=1

      return count


    def keysize_dist(filelocation):
      f=open(filelocation,'r')
      lines=[]
      for line in f.readlines():
        line=line.strip('\n')
        lines.append(line)
      lines=''.join(lines).strip('\n')
      normalized=[]
      for keysize in range(2,40):
        count=compute_distance(lines[0:keysize*2],lines[keysize*2:keysize*4],keysize)

        normalized.append(float(count)/keysize)


      return lines,int(min(normalized))

Solution

  • This is way i understood from your post. I did a python program that generate the ciphered xor stream with cycling key and that try to apply hamming string distance normalized method to find the best potential cycling keysize. I don't convert things into base64 and i apply directly string distance not binary distance.

    #!/usr/bin/python
    
    import sys
    from itertools import cycle
    
    def xor_file_with_cycling_strkey(filelocation,outfile,key):
      print filelocation
      f=open(filelocation,'r')
      f2=open(outfile,'w')
      lines=[]
      text=f.read()
      if text != '':
        for c,k in zip(text,cycle(key)):
          r=chr(ord(c)^ord(k))
          f2.write(r)
      f2.close()
      f.close()
    
    # not used here, see compute_distance_char based on same idea.
    def compute_distance(str1,str2,keysize):
      count=0
      print '%s %s' % (str1,str2)
      str1=str1.replace("\n", "")
      str2=str2.replace("\n", "")
      keysize=str(keysize*8)
      sbin1=format(int(str1,16),'0'+keysize+'b')
      sbin2=format(int(str2,16),'0'+keysize+'b')
      return hamming_distance_str(sbin1,sbin2)
    
    #do preferer hamming_distance_bin which quicker.
    def compute_distance_char(str1,str2,keysize):
      count=0
      str1=str1.replace("\n", "")
      str2=str2.replace("\n", "")
      keysize=str(keysize*8)
      sbin1=''
      sbin2=''
      for c in str1:
        sbin1=sbin1 + format(ord(c),'0'+keysize+'b')
      for c in str2:
        sbin2=sbin2 + format(ord(c),'0'+keysize+'b')
      return hamming_distance_str(sbin1,sbin2)
    
    def hamming_distance_str(str1,str2):
      count=0
      for c1,c2 in zip(str1, str2):
        if  c1!=c2:
          count+=1
      return count
    
    def hamming_distance_bin(str1,str2):
      count=0
      for c1,c2 in zip(str1, str2):
        if  c1!=c2:
          # quick hamming distance, counting number of differing bits.
          s=ord(c1)^ord(c2)
          # count number of bits sets using Wegner algorithm
          while s !=0:
            s&=(s-1);
            count+=1
      return count
    
    def keysize_dist(filelocation):
      potential_keysize=0
      min_dist=40.0
      f=open(filelocation,'r')
      lines=[]
      for line in f.readlines():
        line=line.strip('\n')
        lines.append(line)
      lines=''.join(lines).strip('\n')
      normalized=[]
      for keysize in range(2,40):
    # should first create base16 entries for that one , then don't use it : count_bin1=compute_distance(lines[0:keysize*2],lines[keysize*2:keysize*4],keysize)
        # proof that both functions compute same value :
        count_bin1=compute_distance_char(lines[0:keysize*2],lines[keysize*2:keysize*4],keysize)
        count_bin2=hamming_distance_bin(lines[0:keysize*2],lines[keysize*2:keysize*4])
        if ( count_bin1 != count_bin2 ):
          print 'Discrepency between compute_distance_char->%i and hamming_distance_bin->%i' % (count_bin1,count_bin2)
        count=hamming_distance_str(lines[0:keysize*2],lines[keysize*2:keysize*4])
    
        normalized_distance=float(count)/keysize
        print '%s %f' % (keysize,normalized_distance)
        if ( normalized_distance < min_dist ):
          potential_keysize=keysize
          min_dist=normalized_distance
    #  we are more interested in keysize corresponding to minimal distance, tha n to minimal distance itself.
      return potential_keysize,min_dist
    
    def main(args=sys.argv):
     if ( len(args) < 2 ):
       print 'Please enter cleartext origin file to be ciphered then checked an optionaly a key string ( max length 40 )'
       return 1
     if ( len(args) > 2):
       key=args[2]
     else:
       # on purpose default to key with a KEYSIZE char length 5.
       key='12345'
     xor_file_with_cycling_strkey(args[1],args[1]+'.ciphered',key)
     xor_file_with_cycling_strkey(args[1]+'.ciphered',args[1] + '.cleartext',key)
    
     # raw non base64 encoded.
     print keysize_dist(args[1] + '.ciphered')
    
    if __name__ == "__main__":
        main()
    

    With that code your can get all inputs needed to fully resolve your problem.

    ./hamming_detect_xor_cycle.py cleartext 123456789ABCDE ... (14, 1.7857142857142858)

    It does not detect correctly all size, but i think this is a statistical effect and depends on cleartext that itself can have cycling properties. as your subject tells : using with more blocks can give a better result.