I'd like to know what's the fastest way to get a folder/directory size (in bytes as I'll format them to another size later) in python on a Windows OS pc. The current python version I am using is 3.8. I have tried the following function in python to obtain it:
def get_dir_size(directory):
total = 0
try:
for entry in os.scandir(directory):
if entry.is_file():
total += os.path.getsize(entry.path)
elif entry.is_dir():
total += get_directory_size(entry.path)
except NotADirectoryError:
return os.path.getsize(directory)
return total
However, I realised that the above code takes too long to run as I have quite a few folders to calculate its size, and a large folder with many degrees of subfolders will take up a lot of time to calculate the overall size.
Instead, I was attempting various approaches and have hit a roadblock for this current one. In this approach, I tried to use subprocess.run to run a powershell command and obtain the size. Here is the code I wrote to do so:
folder_path = C:\Users\username\Downloads\test_folder # Suppose this path is a legitimate path
command = fr"Get-ChildItem -Path {folder_path} -Recurse -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum | Select-Object Sum, Count"
proc = subprocess.run(command, shell=True, stdout=subprocess.PIPE)
print(proc.stdout)
The final output I get when I print proc.stdout
is empty, and an error is printed saying that the command I gave is not recognized as an internal or external command, operable program or batch file
.
I would appreciate it if I can get some advice on how to make this work, or if possible, provide suggestions on how to get the folder size in a shorter time possible. The current workable approach I have takes too many hours, and in between the network might get cut.
Thank you.
I recently encountered the same problem as you did. While there may be other optimal solutions, here's how I dealt with it. But before, I want to mention that The 'subprocess' package documentation states that:
On Windows with
shell=True
, the "COMSPEC environment variable" specifies the default shell.
Typically, this variable is set to cmd.exe
by default, which means that any command passed to subprocess.run()
will be executed in the command prompt. That is why, when you tried to run the powershell cmdlet Get-ChildItem
, cmd was unable to recognize it, which led to that error message.
So to execute the PowerShell command you have in cmd, call the PowerShell process using the -Command
parameter as illustrated below:
command = f"powershell -command \"(Get-ChildItem -Force -Path '{folder_path}' -Recurse -ErrorAction SilentlyContinue | measure -Property Length -Sum).sum\""
Note the additional -Force
parameter, which allows the cmdlet to get items that are not normally accessible by the user (i.e. that needs admin privileges), such as hidden or system files.
However, after I ran a speed test, I found that this method is slower compared to the os.walk()
method when dealing with folders that contain a relatively small number of files. This could potentially be attributed to the time required for the PowerShell interface to load in the background before executing the command, (although I am uncertain about this). The speed test was performed on the following methods:
Here are the results of my benchmarking:
Folder Size | childitem_method | walk_method | dir_method |
---|---|---|---|
1 Gb (2 items) | 1.6201 sec | 0.0013 sec | 0.1968 sec |
6 Gb (60 items) | 1.6853 sec | 0.0179 sec | 0.2020 sec |
3 Gb (1000 items) | 1.9482 sec | 0.2969 sec | 0.2832 sec |
10 Gb (10000 items) | 4.5680 sec | 2.8583 sec | 2.0390 sec |
27 Gb (15000 items) | 6.1768 sec | 4.7842 sec | 2.8775 sec |
77 Gb (670000 items) | 6 min : 30 sec | 8 min : 17 sec | 3 min : 37 sec |
Here is the implementation of the three methods I used:
import os
import subprocess
import re
import timeit
def childitem_method(path):
if os.path.isdir(path):
command = f"powershell -command \"(Get-ChildItem -Force -Path '{path}' -Recurse -ErrorAction SilentlyContinue | measure -Property Length -Sum).sum\""
output = (
subprocess.check_output(command, shell=True)
.decode("utf-8", errors="ignore")
.strip()
)
return int(output) if output.isnumeric() else None
def walk_method(path):
if os.path.isdir(path):
return sum(
sum(
os.path.getsize(os.path.join(walk_result[0], element))
for element in walk_result[2]
)
for walk_result in os.walk(path)
)
def dir_method(path):
if os.path.isdir(path):
command = f'chcp 65001 > nul && dir "{path}" /s /a /-C'
output = (
subprocess.check_output(command, shell=True)
.decode("utf-8", errors="ignore")
.strip()
.rsplit("\n", 2)[1]
)
output = re.findall(r"\d+", output)[-1]
return int(output) if output.isnumeric() else None
if __name__ == "__main__":
folder_path, repetitions = r"C:\ ".strip(), 100
print("childitem_method:",
timeit.timeit(lambda: childitem_method(folder_path), number=repetitions)/repetitions)
print("walk_method: ",
timeit.timeit(lambda: walk_method(folder_path), number=repetitions)/repetitions)
print("dir_method: ",
timeit.timeit(lambda: dir_method(folder_path), number=repetitions)/repetitions)