Search code examples
pythonscikit-learnnltkpython-oscountvectorizer

Transforming function for reading txt files into one string to document logics


Have a bunch of .txt files in the folder. Here are two functions which are using for reading these files and saving them into a variable as one string:

s=(glob.glob("/Users/user/documents/folder/*.txt"))

def read_files(files):
    for filename in files:
        with open(filename, 'r', encoding='latin-1') as file:
            yield file.read()

def read_files_as_string(files, separator='\n'):
    files_content = list(read_files(files=files))
    return separator.join(files_content)

results=read_files_as_string(s)

Now my idea to use sklearn's CountVectorizer() for getting n-grams from the text. But CountVectorizer() does not receive as input the string. So my question would be- how can I make the function for reading the files not to storing them into one string but store them using that logic: ['text1.txt', 'text2.txt', ..., 'textn.txt']

Thanks in advance!


Solution

  • read_files already does almost all of what you want. You can call it directly and use list to convert it from a generator into a regular list:

    results = list(read_files(s))