I am writing some scripts to process some text files in python. Locally the script reads from a single txt file thus i use
index_file = open('index.txt', 'r')
for line in index_file:
....
and loop through the file to find a matching string, but when using amazon EMR, the index.txt file per se, is split into multiple txt files in a single folder.
Thus i would like to replicate that locally and read from multiple txt file for a certain string, but i struggle to find clean code to do that.
What is the best way to go about it while writing minimal code?
import os
from glob import glob
def readindex(path):
pattern = '*.txt'
full_path = os.path.join(path, pattern)
for fname in sorted(glob(full_path)):
for line in open(fname, 'r'):
yield line
# read lines to memory list for using multiple times
linelist = list(readindex("directory"))
for line in linelist:
print line,
This script defines a generator (see this question for details about generators) to iterate through all the files in directory "directory" that have extension "txt" in sorted order. It yields all the lines as one stream that after calling the function can be iterated through as if the lines were coming from one open file, as that seems to be what the question author wanted. The comma at the end of print line, makes sure that newline is not printed twice, although the content of the for loop would be replaced by question author anyway. In that case one can use line.rstrip() to get rid of the newline.
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.