Search code examples
pythonpandaspathfilenamescsvreader

Locate only the variable numeric data in filename in pandas column and place numbers in new column


I am using this code to pull in two CSVs of a similar naming convention, place their filenames in a "File" column and concatenate the dataframes into one dataframe called NatHrs.

import glob
from pathlib import Path

path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\\Natl_hours_YTD_OC_*.csv')

Nat_dfs = []
for file in all_files:
    df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
    df['File'] = file
    Nat_dfs.append(df)

NatHrs = pd.concat(Nat_dfs)

Now, I want to take the "File" column, which returns a filename object with entries looking like "C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019", extract ONLY the end of the filename--in this case "2018-2019"--and place those characters into a new colum "Program Year", reflecting the entry "2018-2019". I am not having success with manipulating strings or series--should I be using path.replace? I am lost. When I describe the column I'm looking to parse...

NatHrs['File'].describe

...I get this:

Name: File, dtype: object>

Solution

  • This ultimately ended up working. Thank you much for helping me along!

    globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
    globbed_files
    
    data = []
    for csv in globbed_files:
        frame = pd.read_csv(csv, encoding='windows-1252', header=1)
        frame['filename'] = os.path.basename(csv)
        file = os.path.basename(csv)
    #create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
        frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
        data.append(frame)
    
    NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
    NatHrs.copy().head()