Unix md5sum vs Powershell Get-hash

I am trying to generate md5 hash from Powershell. I installed Powershell Community Extension (Pscx) to get command : Get-Hash

However when I generate md5 hash using Get-Hash, it doesn't seem to match the hash generated using md5sum on an Ubuntu machine.

Powershell:

PS U:\> "hello world" | get-hash -Algorithm MD5

Path Algorithm HashString                       Hash
---- --------- ----------                       ----
     MD5       E42B054623B3799CB71F0883900F2764 {228, 43, 5, 70...}

Ubuntu:

root@LT-A03433:~# echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4  -

I know that the one generated by Ubuntu is correct as a couple of online sites show the same result.

What am I going wrong with Powershell Get-Hash?

Solution

The difference is not obvious, but you are not hashing the same data. MD5 is a hashing algorithm, and it has no notion of text encoding – this is why you can create a hash of binary data just as easily as a hash of text. With that in mind, we can find out what bytes (or octets; strictly a stream of values of 8 bits each) MD5 is calculating the hash of. For this, we can use xxd, or any other hexeditor.

First, your Ubuntu example:

$ echo "hello world" | xxd
0000000: 6865 6c6c 6f20 776f 726c 640a            hello world.

Note the 0a, Unix-style newline at the end, displayed as . in the right view. echo by default appends a newline to what it prints, you could use printf, but this would lead to a different hash.

$ echo "hello world" | md5
6f5902ac237024bdd0c176cb93063dc4

Now let's consider what PowerShell is doing. It is passing a string of its own directly to the get-hash cmdlet. As it turns out, the natural representation of string data in a lot of Windows is not the same as for Unix – Windows uses wide strings, where each character is represented (in memory) as two bytes. More specifically, we can open a text editor, paste in:

hello world

With no trailing newline, and save it as UTF-16, little-endian. If we examine the actual bytes this produces, we see the difference:

$ xxd < test.txt
0000000: 6800 6500 6c00 6c00 6f00 2000 7700 6f00  h.e.l.l.o. .w.o.
0000010: 7200 6c00 6400                           r.l.d.

Each character now takes two bytes, with the second byte being 00 – this is normal (and is the reason why UTF-8 is used across the Internet instead of UTF-16, for example), since the Unicode codepoints for basic ASCII characters are the same as their ASCII representation. Now let's see the hash:

$ md5 < thefile.txt
e42b054623b3799cb71f0883900f2764

Which matches what PS is producing for you.

So, to answer your question – you're not doing anything wrong. You just need to encode your string the same way to get the same hash. Unfortunately I don't have access to PS, but this should be a step in the right direction: UTF8Encoding class.