I have folders with many many files (e.g. over 100k), some files small (less than 1kb) and some files big (e.g. several MBs).
I would like to use pyspark and scan all the files under these folders, e.g. "C:\Xiang". The file names are, for example, Folder 1:
C:\Xiang\fold1\filename1.txt
C:\Xiang\fold1\filename2.txt
C:\Xiang\fold1\filename3.txt
C:\Xiang\fold1\filename1_.meta.txt
C:\Xiang\fold1\filename2_.meta.txt
...
"fold2", "fold3", ... have similarly structure.
I would like to scan all the files under these folders and get the modification time of each file. Ideally, it can be saved into a RDD, with pair as (key, value) with key the filename (e.g. C:\Xiang\filename1.txt) and value the modification time (e.g. 2020-12-16 13:40). So that I could perform further operation on these files, e.g. filter by the modification time and open the selected files. ...
Any idea?
Use pathlib
to get the last modified time and map onto your rdd of file names:
import os
import pathlib
rdd = sc.parallelize(os.listdir("C:\Xiang")) # try slash if backslash doesn't work
rdd2 = rdd.keyBy(lambda x: x).map(lambda f: (f[0], pathlib.Path(f[1]).stat().st_mtime))