Search code examples
pythongeneratorpathlib

python generator parsing one file at a time


I often have a folder with a bunch of csv files or excel or html etc. I get tired of always writing a loop iterating over the files in a folder and then opening them with the appropriate library, so I was hoping I could build a generator that would yield, one file at a time, the file already opened with the appropriate library. Here's what I had been hoping to do:

def __get_filename__(file):
    lst = str(file).split('\\')[-1].split('/')[-1].split('.')
    filename, filetype = lst[-2], lst[-1]
    return filename, filetype

def file_iterator(file_path, parser=None, sep=None, encoding='utf8'):
    import pathlib as pl
    if parser == 'BeautifulSoup':
        from bs4 import BeautifulSoup
    elif parser == 'pandas':
        import pandas as pd

    for file in pl.Path(file_path):
        if file.is_file():
            filename, filetype = __get_filename__(file)
            if filetype == 'csv' and parser == 'pandas':
                yield pd.read_csv(file, sep=sep)
            elif filetype == 'excel' and parser == 'pandas':
                yield pd.read_excel(file, engine='openpyxl')
            elif filetype == 'xml' and parser == 'BeautifulSoup':
                with open(file, encoding=encoding, errors='ignore') as xml:
                    yield BeautifulSoup(xml, 'lxml')
            elif parser == None:
                print(filename, filetype)
                yield file

but my hopes and dreams are crushed :P and if I do this:

for file in file_iterator(r'C:\Users\hwx756\Desktop\tmp/'):
    print(file)

this throws the error TypeError: 'WindowsPath' object is not iterable

I am sure there must be a way to do this somehow and I'm hoping that someone out there much smarter than me knows :) thanks!


Solution

  • so this is what i think you should do. get the names of all files in your folder by this

    from os import listdir
    from os.path import isfile, join
    onlyfiles = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]
    

    make that path absolute and use that absolute path to read files in pandas

    also that file has typo

            yield pd.read_excel(path, engine='openpyxl')
    

    No such thing as path