Search code examples
character-encodingmd5checksum

Does character encoding affect hash algorithms like MD5?


I have an issue where a client calculates the checksum for a file then sends that along with the file via socket to a server. The server recalculates the file's checksum and compares it with the passed checksum. The debug output is as follow

Server calculated checksum: 53613E7F8AB289BDDC1EBF1E0929F1FD
<?xml version="1.0" encoding="utf-16"?>
<TestHistory>
  <TestHistory>
    <SerialNumber>13231312</SerialNumber>
    <PrinterFamily>Tabletop</PrinterFamily>
    <StartTime>1371515505</StartTime>
    <EndTime>1371515510</EndTime>
    <TestStation>PreCal</TestStation>
    <TestResult>P</TestResult>
    <FailureResult>
    </FailureResult>
    <NetworkDomainName>zgn</NetworkDomainName>
    <TestPCName>01-93RZheng</TestPCName>
    <TestSoftwareVersion>0.0.7</TestSoftwareVersion>
    <TestSite>Jabil</TestSite>
    <MAC_BT>00-G0-D0-86-CB-F7</MAC_BT>
    <MAC_WiFi>00-S0-D0-26-BD-X7</MAC_WiFi>
    <MAC_Wired>00-B0-D0-81-BB-L7</MAC_Wired>
    <SKU>1rf3</SKU>
    <WorkOrder>1313231</WorkOrder>
    <Firmware_MLB>3.1321.bd2</Firmware_MLB>
    <DateEntered />
    <PrintheadMFGInfo>ee1</PrintheadMFGInfo>
    <PrintheadSN>13-21</PrintheadSN>
  </TestHistory>
</TestHistory>

.

Client calculated checksum: 0AFE9F429DE4C6E675297FA861C0CCA9
<?xml version="1.0" encoding="utf-16"?>
<TestHistory>
  <TestHistory>
    <SerialNumber>13231312</SerialNumber>
    <PrinterFamily>Tabletop</PrinterFamily>
    <StartTime>1371515505</StartTime>
    <EndTime>1371515510</EndTime>
    <TestStation>PreCal</TestStation>
    <TestResult>P</TestResult>
    <FailureResult>
    </FailureResult>
    <NetworkDomainName>zgn</NetworkDomainName>
    <TestPCName>01-93RZheng</TestPCName>
    <TestSoftwareVersion>0.0.7</TestSoftwareVersion>
    <TestSite>Jabil</TestSite>
    <MAC_BT>00-G0-D0-86-CB-F7</MAC_BT>
    <MAC_WiFi>00-S0-D0-26-BD-X7</MAC_WiFi>
    <MAC_Wired>00-B0-D0-81-BB-L7</MAC_Wired>
    <SKU>1rf3</SKU>
    <WorkOrder>1313231</WorkOrder>
    <Firmware_MLB>3.1321.bd2</Firmware_MLB>
    <DateEntered />
    <PrintheadMFGInfo>ee1</PrintheadMFGInfo>
    <PrintheadSN>13-21</PrintheadSN>
  </TestHistory>
</TestHistory>

Using a standard diff tool on the file portion of the outputs indicates that there are no differences in the two files which leads me to believe that encoding or something else is causing the checksum mismatch.


Solution

  • Yes; it certainly does.

    Hash algorithms operate on raw bytes.

    If you want to hash a string, you need to somehow convert the characters into bytes.

    Different encodings will result in different bytes.

    Also, every single character will contribute bytes that affect the hash, even characters like <?xml.