Have a bunch of .txt files in the folder. Here are two functions which are using for reading these files and saving them into a variable as one string:
s=(glob.glob("/Users/user/documents/folder/*.txt"))
def read_files(files):
for filename in files:
with open(filename, 'r', encoding='latin-1') as file:
yield file.read()
def read_files_as_string(files, separator='\n'):
files_content = list(read_files(files=files))
return separator.join(files_content)
results=read_files_as_string(s)
Now my idea to use sklearn's CountVectorizer()
for getting n-grams from the text. But CountVectorizer()
does not receive as input the string. So my question would be- how can I make the function for reading the files not to storing them into one string but store them using that logic: ['text1.txt', 'text2.txt', ..., 'textn.txt']
Thanks in advance!
read_files
already does almost all of what you want. You can call it directly and use list
to convert it from a generator into a regular list:
results = list(read_files(s))