Search code examples
pythonpandasdataframeglob

Importing files in a specific order using glob


I have a long timeserie where each year is archived within a folder that corresponds to this year. But, in each folder, data is not recorded in a single file, instead, in monthly files.

e.g. 1954 > APR, AUG, DEC... SEP

When I import these files using Glob and create a Dataframe with Pandas they're imported in this same order (above). But, instead, I need a correct months sequence (JAN, FEB, MAR..) for plotting and working with'em. So, my question is:

There is any way to force Glob importing files in a specific order or, maybe, rearrange files using Pandas?

    path = r'path'
    allFiles = glob.glob(path+"/*.dtf")

    df = pd.DataFrame()
    list_ = []
    for file_ in allFiles:
      df = pd.read_csv(file_,header = None,sep=r"\s*")
      list_.append(df)
    df = pd.concat(list_)

Thanks.


Solution

  • You can use concat with parameter keys with filenames:

    Data for testing are here.

    path = r'path-dtfs'
    #add /* for read subfolders
    allFiles = glob.glob(path+"/*/*.dtf")
    print (allFiles)
    ['path\\1954\\FEB.dtf', 'path\\1954\\JAN.dtf', 'path\\1955\\APR.dtf', 'path\\1955\\MAR.dtf']
    
    list_ = []
    for file_ in allFiles:
        df = pd.read_csv(file_,header = None,sep=r"\s+")
        list_.append(df)
    

    Then create new columns by split and insert. For correct sorting need ordered categorical with sort_values:

    df = pd.concat(list_, keys=allFiles)
           .reset_index(level=1, drop=True)
           .rename_axis('years').reset_index()
    
    s = df['years'].str.split('\\')
    df['years'] = s.str[-2].astype(int)
    df.insert(1, 'months', s.str[-1].str.replace('.dtf', ''))
    
    #add all missing months
    cats = ['JAN','FEB','MAR','APR']
    df['months'] = df['months'].astype('category', categories=cats, ordered=True)
    df = df.sort_values(['years','months']).reset_index(drop=True)
    print (df)
       years months  0  1  2
    0   1954    JAN  0  1  2
    1   1954    JAN  1  5  8
    2   1954    FEB  0  9  6
    3   1954    FEB  1  6  4
    4   1955    MAR  5  6  8
    5   1955    MAR  4  7  9
    6   1955    APR  0  3  6
    7   1955    APR  1  4  1
    

    Another solution is create datetime columns by str.extract with to_datetime:

    df = pd.concat(list_, keys=allFiles)
           .reset_index(level=1, drop=True)
           .rename_axis('dates')
           .reset_index()
    df['dates'] = df['dates'].str.extract('path\\\(.*).dtf', expand=False)
    df['dates'] = pd.to_datetime(df['dates'], format='%Y\%b')
    df = df.sort_values('dates').reset_index(drop=True)
    print (df)
           dates  0  1  2
    0 1954-01-01  0  1  2
    1 1954-01-01  1  5  8
    2 1954-02-01  0  9  6
    3 1954-02-01  1  6  4
    4 1955-03-01  5  6  8
    5 1955-03-01  4  7  9
    6 1955-04-01  0  3  6
    7 1955-04-01  1  4  1
    

    Similar solution is use month period by to_period:

    df = pd.concat(list_, keys=allFiles)
           .reset_index(level=1, drop=True)
           .rename_axis('periods').reset_index()
    df['periods'] = df['periods'].str.extract('path\\\(.*).dtf', expand=False)
    df['periods'] = pd.to_datetime(df['periods'], format='%Y\%b').dt.to_period('M')
    df = df.sort_values('periods').reset_index(drop=True)
    
    print (df)
      periods  0  1  2
    0 1954-01  0  1  2
    1 1954-01  1  5  8
    2 1954-02  0  9  6
    3 1954-02  1  6  4
    4 1955-03  5  6  8
    5 1955-03  4  7  9
    6 1955-04  0  3  6
    7 1955-04  1  4  1