I want to obtain file information (file name & size in bytes) for the files in a directory. But there are a lot of sub-directory (~ 1000) and files (~40 000).
Actually my solution is to use filepath.Walk() to obtain file information for each file. But this is quite long.
func visit(path string, f os.FileInfo, err error) error {
if f.Mode().IsRegular() {
fmt.Printf("Visited: %s File name: %s Size: %d bytes\n", path, f.Name(), f.Size())
}
return nil
}
func main() {
flag.Parse()
root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0)
filepath.Walk(root, visit)
}
Is it possible to do parallel/concurrent processing using filepath.Walk()?
You may do concurrent processing by modifying your visit()
function to not go into subfolders, but launch a new goroutine for each subfolder.
In order to do that, return the special filepath.SkipDir
error from your visit()
function if the entry is a directory. Don't forget to check if the path
inside visit()
is the subfolder the goroutine is ought to process, because that is also passed to visit()
, and without this check you would launch goroutines endlessly for the initial folder.
Also you will need some kind of "counter" of how many goroutines are still working in the background, for that you may use sync.WaitGroup
.
Here's a simple implementation of this:
var wg sync.WaitGroup
func walkDir(dir string) {
defer wg.Done()
visit := func(path string, f os.FileInfo, err error) error {
if f.IsDir() && path != dir {
wg.Add(1)
go walkDir(path)
return filepath.SkipDir
}
if f.Mode().IsRegular() {
fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",
path, f.Name(), f.Size())
}
return nil
}
filepath.Walk(dir, visit)
}
func main() {
flag.Parse()
root := "folder/to/walk" //flag.Arg(0)
wg.Add(1)
walkDir(root)
wg.Wait()
}
Some notes:
Depending on the "distribution" of files among subfolders, this may not fully utilize your CPU / storage, as if for example 99% of all the files are in one subfolder, that goroutine will still take the majority of time.
Also note that fmt.Printf()
calls are serialized, so that will also slow down the process. I assume this was just an example, and in reality you will do some kind of processing / statistics in-memory. Don't forget to also protect concurrent access to variables accessed from your visit()
function.
Don't worry about the high number of subfolders. It is normal and the Go runtime is capable of handling even hundreds of thousands of goroutines.
Also note that most likely the performance bottleneck will be your storage / hard disk speed, so you may not gain the performance you wish. After a certain point (your hard disk limit), you won't be able to improve performance.
Also launching a new goroutine for each subfolder may not be optimal, it may be that you get better performance by limiting the number of goroutines walking your folders. For that, check out and use a worker pool: