Search code examples
pythonpysparkpyspark-pandas

Read latest file grouped by monthYear in directory in pyspark


I have multiple files in a directory. File name are similar to those added in picture 1. enter image description here

I want to read only latest file for each month from the directory in pyspark as dataframe. Expected files to be read as shown in picture 2 enter image description here


Solution

  • import os
    import glob
    
    path = '/your_path/'
    form = 'csv'
    os.chdir(path)
    files_list = glob.glob('*.{}'.format(form))
    
    dic = {}
    
    
    prefix = files_list[0][:4]
    suffix = files_list[0][-4:]
    
    for i in range(0, len(files_list)):
        
        ym = files_list[i][4:12][:6]
        d = files_list[i][4:12][6:]
        
        if ym in dic:
            if d > dic[ym]:
                dic[ym] = d
        else:
            dic[ym] = d
        
    files_to_open = [path+prefix+x+y+suffix for (x,y) in dic.items()]
    
    
    df = spark.read.format(form).option("header", "true").load(files_to_open)