My program does the following:
For each file:
2.1. read the file
2.2 sort the contents as a list and pushes the list to a master list
I did this without any async/await and these are the time statistics
real 0m0.036s
user 0m0.018s
sys 0m0.009s
With the below async/await code I get
real 0m0.144s
user 0m0.116s
sys 0m0.029s
which given the use case suggests that I am using aysncio incorrectly.
Anybody have an idea what I am doing wrong?
import asyncio
import aiofiles
import os
directory = "/tmp"
listOfLists = list()
async def sortingFiles(numbersInList):
numbersInList.sort()
async def awaitProcessFiles(filename,numbersInList):
await readFromFile(filename,numbersInList)
await sortingFiles(numbersInList)
await appendToList(numbersInList)
async def readFromFile(filename,numbersInList):
async with aiofiles.open(directory+"/"+filename, 'r') as fin:
async for line in fin:
return numbersInList.append(int(line.strip("\n"),10))
fin.close()
async def appendToList(numbersInList):
listOfLists.append(numbersInList)
async def main():
tasks=[]
for filename in os.listdir(directory):
if filename.endswith(".txt"):
numbersInList =list()
task=asyncio.ensure_future(awaitProcessFiles(filename,numbersInList))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__== "__main__":
asyncio.run(main())
Profiling info:
151822 function calls (151048 primitive calls) in 0.239 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
11 0.050 0.005 0.050 0.005 {built-in method _imp.create_dynamic}
57 0.022 0.000 0.022 0.000 {method 'read' of '_io.BufferedReader' objects}
57 0.018 0.000 0.018 0.000 {built-in method io.open_code}
267 0.012 0.000 0.012 0.000 {method 'control' of 'select.kqueue' objects}
57 0.009 0.000 0.009 0.000 {built-in method marshal.loads}
273 0.009 0.000 0.009 0.000 {method 'recv' of '_socket.socket' objects}
265 0.005 0.000 0.098 0.000 base_events.py:1780(_run_once)
313 0.004 0.000 0.004 0.000 {built-in method posix.stat}
122 0.004 0.000 0.004 0.000 {method 'acquire' of '_thread.lock' objects}
203/202 0.003 0.000 0.011 0.000 {built-in method builtins.__build_class__}
1030 0.003 0.000 0.015 0.000 thread.py:158(submit)
1030 0.003 0.000 0.009 0.000 futures.py:338(_chain_future)
7473 0.003 0.000 0.003 0.000 {built-in method builtins.hasattr}
1030 0.002 0.000 0.017 0.000 futures.py:318(_copy_future_state)
36 0.002 0.000 0.002 0.000 {built-in method posix.getcwd}
3218 0.002 0.000 0.077 0.000 {method 'run' of 'Context' objects}
6196 0.002 0.000 0.003 0.000 threading.py:246(__enter__)
3218 0.002 0.000 0.078 0.000 events.py:79(_run)
6192 0.002 0.000 0.004 0.000 base_futures.py:13(isfuture)
1047 0.002 0.000 0.002 0.000 threading.py:222(__init__)
Make some test files...
import random, os
path = <directory name here>
nlines = range(1000)
nfiles = range(1,101)
for n in nfiles:
fname = f'{n}.txt'
with open(os.path.join(path,fname),'w') as f:
for _ in nlines:
q = f.write(f'{random.randrange(1,10000)}\n')
asyncio makes little sense for local files. That is the reason, even python standard library does not have them.
async for line in fin:
Consider the above line. The event loop pauses the co-routine for every line read and executes some other co-routine. Which means the following lines of the file in the cpu cache are just thrown away to make space for the next co-routine. (They will still be in RAM though).
When should aiofiles be used?
Consider you already use async code in your program and occasionally you have to do some file processing. If file processing was done in the same event loop, all the other co-routines are going to be blocked. In that case you can either use aiofiles or do the processing in a different executor.
If all the program is doing is just reading from files. It will be faster to do them sequentially so that it makes good use of cache. Jumping from one file to another is like an thread context switch and should make it slower.