Search code examples
pythonazureazure-machine-learning-serviceazure-notebooks

AzureML list huge amount of files


I have directory in AzureML notebook in which I have 300k files and need to list their names. Approach below works but takes 1.5h to execute:

from os import listdir
from os.path import isfile, join
mypath = "./temp/"
docsOnDisk = [f for f in listdir(mypath) if isfile(join(mypath, f))]

What is the azure way to quickly list those files? (both notebook and this directory is in FileShare).

I am also aware that the approach below will give some gain, but still it is not the azure way to do this.

docsOnDisk = [f.name for f in scandir(mypath) ] # shall be 2-20x faster

Solution

  • Try using glob module and filter method instead of list comprehension.

    import glob
    from os.path import isfile
    mypath = "./temp/*"
    docsOnDisk = glob.glob(mypath)
    verified_docsOnDisk = list(filter(lambda x:isfile(x), docsOnDisk))
    

    glob should give only existing files. Its not needed to verify them by using isfile(). But still if you need to try it out then you can use filter method instead of list comprehension. To skip verification, you can comment last line.