Open XML document protection implementation (documentProtection class)

I'm trying to implement the Open XML documentProtection hash protection of a MS Word (2019) document in Python to test the hashing algorithm. So I've created a Word document, protected it against editing with this password: johnjohn. Then, opening the document as ZIP/XML, I see the following in the documentProtection section:

<w:documentProtection w:edit="readOnly" w:enforcement="1" w:cryptProviderType="rsaAES" w:cryptAlgorithmClass="hash" w:cryptAlgorithmType="typeAny" w:cryptAlgorithmSid="14" w:cryptSpinCount="100000" w:hash="pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw==" w:salt="pH1TDVHSfGBxkd3Q88UNhQ==" />

According to the Open XML docs (ECMA-376-1:2016 #17.15.1.29):

cryptAlgorithmSid="14" points to the SHA-512 algorithm
cryptSpinCount="100000" means that hashing must be done in 100k rounds, using the following algoright (quote from above standard):

Specifies the number of times the hashing function shall be iteratively run (runs using each iteration's result plus a 4 byte value (0-based, little endian) containing the number of the iteration as the input for the next iteration) when attempting to compare a user-supplied password with the value stored in the hashValue attribute.

The BASE64-encoded salt used for hashing ("pH1TDVHSfGBxkd3Q88UNhQ==") is prepended to the original password. The target BASE64-encoded hash must be "pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=="

So my Python script attempts to generate the same hash value with the described algorithm as follows:

import hashlib
import base64
import struct

TARGET_HASH = 'pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=='

TARGET_SALT = 'pH1TDVHSfGBxkd3Q88UNhQ=='
bsalt = base64.b64decode(TARGET_SALT)

def hashit(what, alg='sha512', **kwargs):
    if alg == 'sha1':
        return hashlib.sha1(what)
    elif alg == 'sha512':
        return hashlib.sha512(what)
    # etc...
    else:
        raise Exception(f'Unsupported hash algorithm: {alg}')

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    # encode password in UTF-16LE
    # ECMA-376-1:2016 17.15.1.29 (p. 1026)
    if isinstance(data, str): data = data.encode('utf-16-le')
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): salt = salt.encode('utf-16-le')
        ghash = salt + data
    else:
        ghash = data
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data) + 4-byte integer (previous round number) with LE byte ordering
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash, alg).digest() + struct.pack('<I', i)
        except Exception as err:
            print(err)
            break
    
    # remove trailing round number bytes
    ghash = ghash[:-4]

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

But then when I run

print(gethash('johnjohn', bsalt))

I get the following hash which is not equal to the target one:

G47RT4/+JdE6pnrP6MqUKa3JyL8abeYSCX+E4+9J+6shiZqImBJ8M6bb+IMKEdvKd6+9dVnQ3oeOsgQz/aCdcQ==

Could I be wrong in my implementation somewhere or do you think there's a difference in the low-level hash function implementation (Python's hashlib vs. Open XML)?

Updated

I realized that Word uses a legacy algorithm to pre-process passwords (for compatibility with older versions). This algorithm is described at length in ECMA-376-1:2016 Part 4 (Transitional Migration Features, #14.8.1 "Legacy Password Hash Algorithm"). So I've managed to make a script that reproduces the official ECMA example:

def strtobytes(s, trunc=15):    
    b = s.encode('utf-16-le')
    # remove BOM symbol if present
    if b[0] == 0xfeff: b = b[1:]    
    pwdlen = min(trunc, len(s))
    if pwdlen < 1: return None
    return bytes([b[i] or b[i+1] for i in range(0, pwdlen * 2, 2)])

def process_pwd(pwd):
    # 1. PREPARE PWD STRING (TRUNCATE, CONVERT TO BYTES)
    pw = strtobytes(pwd) if isinstance(pwd, str) else pwd[:15]
    pwdlen = len(pw)
    
    # 2. HIGH WORD CALC
    HW = InitialCodeArray[pwdlen - 1]
    for i in range(pwdlen):
        r = 15 - pwdlen + i
        for ibit in range(7):
            if (pw[i] & (0x0001 << ibit)):                
                HW ^= EncryptionMatrix[r][ibit]
    
    # 3. LO WORD CALC
    LW = 0
    for i in reversed(range(pwdlen)):
        LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pw[i]
    LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pwdlen ^ 0xCE4B    
    
    # 4. COMBINE AND REVERSE
    return bytes([LW & 0xff, LW >> 8, HW & 0xff, HW >> 8])

So when I do process_pwd('Example') I get what's said in the ECMA (0x7EEDCE64). The hashing function was also modified (the initial SALT + HASH should not be included in the main iteration loop, as I found on a forum):

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    
    def hashit(what, alg='sha512'):
        return getattr(hashlib, alg)(what)
    
    # encode password with legacy algorithm if a string is given
    if isinstance(data, str): 
        data = process_pwd(data)
        
    if data is None: 
        print('WRONG PASSWORD STRING!')
        return None
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): 
            salt = process_pwd(salt)
            if salt is None:
                print('WRONG SALT STRING!')
                return None
        ghash = salt + data
    else:
        ghash = data
    
    # initial hash (salted)
    ghash = hashit(ghash, alg).digest()
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data + 4-byte integer (previous round number) with LE byte ordering)
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash + struct.pack('<I', i), alg).digest()
        except Exception as err:
            print(err)
            return None

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
        
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

However many times I've re-checked this code, I couldn't see any more errors. But I still can't reproduce the target hash in the test Word document:

myhash = gethash('johnjohn', base64.b64decode('pH1TDVHSfGBxkd3Q88UNhQ=='))
print(myhash)
print(TARGET_HASH == myhash)

I get:

wut2VOpT+X8pKXky6u/+YtwRX2inDv1WVC8FtZcdxKsyX0gHNBJGYwBgV8xzq7Rke/hWMfWe9JVvqDQAZ11A5w==

False

UPDATE (August 2022)

Returning to this question, I've updated my Python code adapting the detailed answer below (thanks @Andrew O!). My full code is now as follows:

# coding: utf-8
import hashlib
import base64

TARGET_HASH = 'pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=='
TARGET_SALT = 'pH1TDVHSfGBxkd3Q88UNhQ=='

HighOrderWords = [
    [0xE1, 0xF0],
    [0x1D, 0x0F],
    [0xCC, 0x9C],
    [0x84, 0xC0],
    [0x11, 0x0C],
    [0x0E, 0x10],
    [0xF1, 0xCE],
    [0x31, 0x3E],
    [0x18, 0x72],
    [0xE1, 0x39],
    [0xD4, 0x0F],
    [0x84, 0xF9],
    [0x28, 0x0C],
    [0xA9, 0x6A],
    [0x4E, 0xC3]
]

EncryptionMatrix = [
    [[0xAE, 0xFC], [0x4D, 0xD9], [0x9B, 0xB2], [0x27, 0x45], [0x4E, 0x8A], [0x9D, 0x14], [0x2A, 0x09]],
    [[0x7B, 0x61], [0xF6, 0xC2], [0xFD, 0xA5], [0xEB, 0x6B], [0xC6, 0xF7], [0x9D, 0xCF], [0x2B, 0xBF]],
    [[0x45, 0x63], [0x8A, 0xC6], [0x05, 0xAD], [0x0B, 0x5A], [0x16, 0xB4], [0x2D, 0x68], [0x5A, 0xD0]],
    [[0x03, 0x75], [0x06, 0xEA], [0x0D, 0xD4], [0x1B, 0xA8], [0x37, 0x50], [0x6E, 0xA0], [0xDD, 0x40]],
    [[0xD8, 0x49], [0xA0, 0xB3], [0x51, 0x47], [0xA2, 0x8E], [0x55, 0x3D], [0xAA, 0x7A], [0x44, 0xD5]],
    [[0x6F, 0x45], [0xDE, 0x8A], [0xAD, 0x35], [0x4A, 0x4B], [0x94, 0x96], [0x39, 0x0D], [0x72, 0x1A]],
    [[0xEB, 0x23], [0xC6, 0x67], [0x9C, 0xEF], [0x29, 0xFF], [0x53, 0xFE], [0xA7, 0xFC], [0x5F, 0xD9]],
    [[0x47, 0xD3], [0x8F, 0xA6], [0x0F, 0x6D], [0x1E, 0xDA], [0x3D, 0xB4], [0x7B, 0x68], [0xF6, 0xD0]],
    [[0xB8, 0x61], [0x60, 0xE3], [0xC1, 0xC6], [0x93, 0xAD], [0x37, 0x7B], [0x6E, 0xF6], [0xDD, 0xEC]],
    [[0x45, 0xA0], [0x8B, 0x40], [0x06, 0xA1], [0x0D, 0x42], [0x1A, 0x84], [0x35, 0x08], [0x6A, 0x10]],
    [[0xAA, 0x51], [0x44, 0x83], [0x89, 0x06], [0x02, 0x2D], [0x04, 0x5A], [0x08, 0xB4], [0x11, 0x68]],
    [[0x76, 0xB4], [0xED, 0x68], [0xCA, 0xF1], [0x85, 0xC3], [0x1B, 0xA7], [0x37, 0x4E], [0x6E, 0x9C]],
    [[0x37, 0x30], [0x6E, 0x60], [0xDC, 0xC0], [0xA9, 0xA1], [0x43, 0x63], [0x86, 0xC6], [0x1D, 0xAD]],
    [[0x33, 0x31], [0x66, 0x62], [0xCC, 0xC4], [0x89, 0xA9], [0x03, 0x73], [0x06, 0xE6], [0x0D, 0xCC]],
    [[0x10, 0x21], [0x20, 0x42], [0x40, 0x84], [0x81, 0x08], [0x12, 0x31], [0x24, 0x62], [0x48, 0xC4]]
]

def hashit(what, alg='sha1', **kwargs):
    f = getattr(hashlib, alg, None)
    if f is None:
        raise Exception(f'Unsupported hash algorithm: {alg}')
    return f(what)

def strtobytes(s, trunc=15):    
    b = s.encode('utf-16-le')
    # remove BOM symbol if present
    if b[0] == 0xfeff: b = b[1:]    
    pwdlen = min(trunc, len(s))
    if pwdlen < 1: return None
    return bytearray([b[i] or b[i+1] for i in range(0, pwdlen * 2, 2)])

def generate_hash(password: str, salt: bytes = None, alg: str = 'sha512', iters: int = 100000, base64result=True, returnstring=True):
    """
    Algorithm given in ECMA-374, 1st Edition, December 2006
    https://www.ecma-international.org/wp-content/uploads/ecma-376_first_edition_december_2006.zip
    Alternatively: https://c-rex.net/projects/samples/ooxml/e1/Part4/OOXML_P4_DOCX_documentProtection_topic_ID0EJVTX.html
    """
    # Truncate the password to 15 characters
    passwordBytes = strtobytes(password)
    # Obtain the high-order word from the magic list based on the length of the password. 
    # If the password is 0 length, it's just two zero bytes
    passwordLength = len(passwordBytes)
    highOrderWord = bytearray([0, 0])
    # For each byte in the password, grab the bits based on its position in the encryption matrix 
    # (taking care that the last character always corresponds to the last row, 
    # the first part of the matrix may be unused if the password is shorter than 15 bytes). 
    # For the first to seventh bit, if it's set, do a XOR operation with the current value of the high order word. 
    # Repeat for each character. 
    if passwordLength > 0:
        highOrderWord = bytearray(HighOrderWords[passwordLength - 1])
        for i in range(passwordLength):
            passwordByte = passwordBytes[i]
            m = i + 15 - passwordLength
            for j in range(7):
                if (passwordByte & j) == 0: 
                    continue
                for k in range(2):
                    highOrderWord[k] ^= EncryptionMatrix[m][j][k]
    # Grab a low order word (2 bytes) and initialize to zero
    lowOrderWord = 0  
    # Perform the operation on each character, starting from the last character in the password and working forwards: 
    # low-order word = ( ((low-order word >> 14) AND 0x0001) | (low-order word << 1) & 0x7FFF)) ^ character (byte)
    for i in reversed(range(passwordLength)):
        passwordByte = passwordBytes[i]
        lowOrderWord = ( ((lowOrderWord >> 14) & 1) | ((lowOrderWord << 1) & 0x7FFF) ) ^ passwordByte
    # Then do low-order word = (((low-order word >> 14) & 0x0001) | (low-order word << 1) & 0x7FFF)) ^ password length ^ 0xCE4B
    lowOrderWord = ( ((lowOrderWord >> 14) & 1) | ((lowOrderWord << 1) & 0x7FFF) ) ^ passwordLength ^ 0xCE4B
    lowOrderWord = lowOrderWord.to_bytes(2, 'big')

    # Form the key by appending the low order word to the high order word, then reverse the byte ordering
    key = (highOrderWord + lowOrderWord)[::-1]
    # For some reason, Microsoft Word then uses the Unicode hex representation of the above key, 
    # then back converts that representation into bytes
    # In Word, an additional third stage is added to the process of hashing and storing a user supplied password. 
    # In this third stage, the reversed byte order legacy hash from the second stage shall be converted to Unicode hex string representation 
    # [Example: If the single byte string 7EEDCE64 is converted to Unicode hex string it will be represented in memory as the following byte stream: 
    # 37 00 45 00 45 00 44 00 43 00 45 00 36 00 34 00], and that value shall be hashed as defined by the attribute values
    # https://learn.microsoft.com/en-us/openspecs/office_standards/ms-oe376/fb220a2f-88d4-488c-a9b7-e094756b6699
    key = ''.join('{:02x}'.format(x) for x in key).replace('-', '').encode('utf-8')
    computedHash = bytearray(key)
    # Now compute the hash once by prepending the salt bytes to the result from above. 
    # If there are no salt bytes, just skip this step
    if salt:
        computedHash = bytearray(salt) + key
    # Word requires that the initial hash of the password with the salt not be considered in the count
    computedHash = bytearray(hashit(computedHash, alg).digest())
    # If there are iterations to compute, for each iteration, convert the iteration count (0-base) to a 32-bit (4 byte) integer (little endian), 
    # and (documentation wasn't clear on this, it just said to "add" the bytes - but to align with the output I had to append it) append this to the current computed hash. 
    # Apply the requested hash algorithm (Word seems to default to SHA512, but from testing I saw that it handles the other options fine as well)
    for i in range(iters):
        # ISO/IEC 29500-1 Fourth Edition, 2016-11-01
        # 17.15.1.29 - spinCount
        # Specifies the number of times the hashing function shall be iteratively run 
        # (runs using each iteration's result plus a 4 byte value (0-based, little endian) containing the number of the iteration 
        # as the input for the next iteration) when attempting to compare a user-supplied password with the value stored in the hashValue attribute
        computedHash += i.to_bytes(4, 'little')
        computedHash = bytearray(hashit(computedHash, alg).digest())

    # Return the above as a base-64 encoded string. This is what goes in the documentProtection attribute.

    # BASE64 encode if requested
    if base64result:
        computedHash = base64.b64encode(computedHash)
        
    # return as an ASCII string if requested
    if returnstring:
        computedHash = computedHash.decode('utf-8')
        
    return computedHash

# -------------------------------------------------------------------- #

if __name__ == '__main__':
    myhash = generate_hash('johnjohn', base64.b64decode(TARGET_SALT))
    print(myhash)
    print(TARGET_HASH == myhash)

But ALAS! -- still assertion fails. Which means I'm getting something wrong here... Who can help adapt the C# to Python 1:1?

Solution

I tried to implement the method VcSaJen explained in his answer and it worked for me (my use case is creating a hash that works in ms word), so just posting the code here if it helps someone

import hashlib
import os
import base64


def docx_sha512_hash(password, spin_count=0):
    salt = os.urandom(16)
    password_bytes = password.encode("utf-16le")
    combined_bytes = salt + password_bytes
    hashed_password = hashlib.sha512(combined_bytes).digest()
    for i in range(spin_count):
        index_bytes = i.to_bytes(4, byteorder="little")
        hashed_password += index_bytes
        hashed_password = hashlib.sha512(hashed_password).digest()
    b64_hash = base64.b64encode(hashed_password).decode("utf-8")
    b64_salt = base64.b64encode(salt).decode("utf-8")
    return b64_hash, b64_salt

then add w:documentProtection w:edit="readOnly" w:enforcement="1" w:algorithmName="SHA-512" w:spinCount="100000" w:hashValue="<hashValue>" w:saltValue="<saltValue>" to the xml and create the docx