Search code examples
performancemd5

Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the same?


I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?


Solution

  • As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.

    Here is an example. Create a 120GB file and check its size:

    dd if=/dev/random of=junk bs=1g count=120
    
    ls -lh junk
    -rw-r--r--  1 mark  staff   120G  5 Oct 13:34 junk
    

    Checksum in one go:

    time md5sum junk
    3c8fb0d5397be5a8b996239f1f5ce2f0  junk
    
    real    3m55.713s       <--- 4 minutes
    user    3m28.441s
    sys     0m24.871s
    

    Checksum in 10GB chunks, with 12 CPU cores in parallel:

    time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
    29010b411a251ff467a325bfbb665b0d  -
    793f02bb52407415b2bfb752827e3845  -
    bf8f724d63f972251c2973c5bc73b68f  -
    d227dcb00f981012527fdfe12b0a9e0e  -
    5d16440053f78a56f6233b1a6849bb8a  -
    dacb9fb1ef2b564e9f6373a4c2a90219  -
    ba40d6e7d6a32e03fabb61bb0d21843a  -
    5a5ee62d91266d9a02a37b59c3e2d581  -
    95463c030b73c61d8d4f0e9c5be645de  -
    4bcd7d43849b65d98d9619df27c37679  -
    92bc1f80d35596191d915af907f4d951  -
    44f3cb8a0196ce37c323e8c6215c7771  -
    
    real    1m0.046s      <--- 1 minute
    user    4m51.073s
    sys     3m51.335s
    

    It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.