Search code examples
pythonfileapache-sparkpysparkrdd

get filename and file modification/creation time as (key, value) pair into RDD using pyspark


I have folders with many many files (e.g. over 100k), some files small (less than 1kb) and some files big (e.g. several MBs).

I would like to use pyspark and scan all the files under these folders, e.g. "C:\Xiang". The file names are, for example, Folder 1:

C:\Xiang\fold1\filename1.txt
C:\Xiang\fold1\filename2.txt
C:\Xiang\fold1\filename3.txt
C:\Xiang\fold1\filename1_.meta.txt
C:\Xiang\fold1\filename2_.meta.txt
...

"fold2", "fold3", ... have similarly structure.

I would like to scan all the files under these folders and get the modification time of each file. Ideally, it can be saved into a RDD, with pair as (key, value) with key the filename (e.g. C:\Xiang\filename1.txt) and value the modification time (e.g. 2020-12-16 13:40). So that I could perform further operation on these files, e.g. filter by the modification time and open the selected files. ...

Any idea?


Solution

  • Use pathlib to get the last modified time and map onto your rdd of file names:

    import os
    import pathlib
    
    rdd = sc.parallelize(os.listdir("C:\Xiang"))  # try slash if backslash doesn't work
    rdd2 = rdd.keyBy(lambda x: x).map(lambda f: (f[0], pathlib.Path(f[1]).stat().st_mtime))