I need to calculate sha256 checksums for files over 1GB (read file by chunks), currently I am using python with this:
import hashlib
import time
start_time = time.time()
def sha256sum(filename="big.txt", block_size=2 ** 13):
sha = hashlib.sha256()
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
sha.update(chunk)
return sha.hexdigest()
input_file = '/tmp/1GB.raw'
print 'checksum is: %s\n' % sha256sum(input_file)
print 'Elapsed time: %s' % str(time.time() - start_time)
I wanted to give a try to golang thinking I could get faster results, but after trying the following code, it runs a couple of seconds slower:
package main
import (
"crypto/sha256"
"fmt"
"io"
"math"
"os"
"time"
)
const fileChunk = 8192
func File(file string) string {
fh, err := os.Open(file)
if err != nil {
panic(err.Error())
}
defer fh.Close()
stat, _ := fh.Stat()
size := stat.Size()
chunks := uint64(math.Ceil(float64(size) / float64(fileChunk)))
h := sha256.New()
for i := uint64(0); i < chunks; i++ {
csize := int(math.Min(fileChunk, float64(size-int64(i*fileChunk))))
buf := make([]byte, csize)
fh.Read(buf)
io.WriteString(h, string(buf))
}
return fmt.Sprintf("%x", h.Sum(nil))
}
func main() {
start := time.Now()
fmt.Printf("checksum is: %s\n", File("/tmp/1G.raw"))
elapsed := time.Since(start)
fmt.Printf("Elapsed time: %s\n", elapsed)
}
Any idea how to improve the golang code if possible? maybe to use all computer CPU cores, one for reading and other for hashing, any ideas ?
As suggested I am using this code:
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"io"
"os"
"time"
)
func main() {
start := time.Now()
fh, err := os.Open("/tmp/1GB.raw")
if err != nil {
panic(err.Error())
}
defer fh.Close()
h := sha256.New()
_, err = io.Copy(h, fh)
if err != nil {
panic(err.Error())
}
fmt.Println(hex.EncodeToString(h.Sum(nil)))
fmt.Printf("Elapsed time: %s\n", time.Since(start))
}
For testing I am creating the 1GB file with this:
# mkfile 1G /tmp/1GB.raw
The new version is faster but not that much, what about using channels? could the use of more than one CPU/core could help to improve? I was expecting to have an improvement of at least 20% but unfortunately I am getting almost no gain, is almost nothing.
time result for python
5.867u 0.250s 0:06.15 99.3% 0+0k 0+0io 0pf+0w
time results for go after compiling (go build) and executing the binary:
5.687u 0.198s 0:05.93 98.9% 0+0k 0+0io 0pf+0w
Any more ideas?
Using the version using channels posted below on the accepted answer by @icza
Elapsed time: 5.894779733s
Using the version with no channels:
Elapsed time: 5.823489239s
I thought that using channels would increase a little bit but seems to not.
I am running this on a MacBook Pro OS X Yosemite. using go version:
go version go1.4.1 darwin/amd64
Setting runtime.GOMAXPROCS to 4:
runtime.GOMAXPROCS(4)
Made things faster:
Elapsed time: 5.741511748s
Changing the chunk size to 8192 (like in the python version) give the expected result:
...
for b, hasMore := make([]byte, 8192<<10), true; hasMore; {
...
Also using only runtime.GOMAXPROCS(2)
Your solution is quite inefficient as you're making new buffers in each iteration, you use them once and you just throw them away.
Also you convert the content of your buffer (buf
) to string
and you write that string
to the sha256 calculator which converts it back to bytes: an absolutely unnecessary round-trip.
Here is another quite fast solution, test this for performance:
fh, err := os.Open(file)
if err != nil {
panic(err.Error())
}
defer fh.Close()
h := sha256.New()
_, err = io.Copy(h, fh)
if err != nil {
panic(err.Error())
}
fmt.Println(hex.EncodeToString(h.Sum(nil)))
A little explanation:
io.Copy()
is a function which will read all the data (until EOF is reached) from a Reader
and write all those to the specified Writer
. Since the sha256 calculator (hash.Hash
) implements Writer
and the File
(or rather *File
) implements Reader
, this is as easy as it can be.
Once all the data has been written to the hash, hex.EncodeToString()
will simply convert the result (obtained by hash.Sum(nil)
) to a human-readable, hex string.
The program reads 1GB of data from the hard disk and does some calculation with it (calculates its SHA-256 hash). Since reading from the hard disk is a relatively slow operation, the performance gain of the Go version will not be significant compared to the Python solution. The overall run takes a couple of seconds which is in the same order of magnitude as the time required to read 1 GB of data from the hard disk. Since both the Go and the Python solution requires approximately the same amount of time to read the data from the disk, you won't see much different results.
There is a slight margin where you can improve performance by reading a chunck of the file into one buffer, start calculating its SHA-256 hash, and at the same time read the next chunck of the file. Once its done, send that to the SHA-256 calculator and at the same time read the next chunk into the first buffer.
But since reading the data from the disk takes more time than calculating its SHA-256 digest (or updating the state of the digest calculator), you won't see significant improvement. The performance bottleneck in your case will always be the time required to read the data into memory.
Here is a complete, runnable solution using 2 goroutines where while 1 goroutine reads a chunk of the file the other calculates hash of a previously read chunk, and when the reading of a goroutine finishes continues with hashing and allowing the other to read in parallel.
Proper synchronization between the phases (reading, hashing) is done with channels. As suspected, the performance gain is just a little over 4% in time (may vary based on CPU and hard disk speed) because the hashing computation is negligible compared to the disk reading time. The performance gain will most likely be higher if the reading speed of the hard disk is greater (test it on SSD).
So the complete program:
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"hash"
"io"
"os"
"runtime"
"time"
)
const file = "t:/1GB.raw"
func main() {
runtime.GOMAXPROCS(2) // Important as Go 1.4 uses only 1 by default!
start := time.Now()
f, err := os.Open(file)
if err != nil {
panic(err)
}
defer f.Close()
h := sha256.New()
// 2 channels: used to give green light for reading into buffer b1 or b2
readch1, readch2 := make(chan int, 1), make(chan int, 1)
// 2 channels: used to give green light for hashing the content of b1 or b2
hashch1, hashch2 := make(chan int, 1), make(chan int, 1)
// Start signal: Allow b1 to be read and hashed
readch1 <- 1
hashch1 <- 1
go hashHelper(f, h, readch1, readch2, hashch1, hashch2)
hashHelper(f, h, readch2, readch1, hashch2, hashch1)
fmt.Println(hex.EncodeToString(h.Sum(nil)))
fmt.Printf("Elapsed time: %s\n", time.Since(start))
}
func hashHelper(f *os.File, h hash.Hash, mayRead <-chan int, readDone chan<- int, mayHash <-chan int, hashDone chan<- int) {
for b, hasMore := make([]byte, 64<<10), true; hasMore; {
<-mayRead
n, err := f.Read(b)
if err != nil {
if err == io.EOF {
hasMore = false
} else {
panic(err)
}
}
readDone <- 1
<-mayHash
_, err = h.Write(b[:n])
if err != nil {
panic(err)
}
hashDone <- 1
}
}
Notes:
In my solution I only used 2 goroutines. There is no point using more because as noted before the disk reading speed is the bottleneck which is already used at its maximum as 2 goroutines will be able to perform reading at any time.
Notes on synchronization: 2 goroutines run parallel. Each goroutine is allowed to use its local buffer b
at any time. Access to the shared File
and to the shared Hash
is synchronized by the channels, only 1 goroutine is allowed to use the Hash
at any given time, and only 1 goroutine is allowed to use (read) from the File
at any given time.