I have a bunch of files in a directory named with nameid_cityid.txt, nameid and cityid being the ids of name (integer(10)) and city (integer(10)) in mydata table.
While the following solution works, I am doing type conversions since fetchall fetches 'L' and the file name tuple of nameid, cityid are strings,..
If you can suggest a pythonic or more elegant way of doing the same, that will be awesome, for me and the communtiy!
What I am trying to achieve : Find those files from a directory that don't have a record in the database and then do something with that file, like parse/move/delete it.
MySQL table mydata :
nameid cityid
15633 45632
2354 76894
Python :
for pdffile in os.listdir(filepath):
cityid, nameid = pdffile.strip('.txt').split('_')[0], pdffile.strip('.txt').split('_')[1]
cursor.execute("select cityid, nameid from mydata")
alreadyparsed = cursor.fetchall()
targetvalues = ((str(cityid), str(nameid)) for cityid, nameid in alreadyparsed)
if (int(cityid), int(nameid)) in alreadyparsed:
print cityid, nameid, "Found"
else:
print cityid, nameid, "Not found"
I'd use a set for quick and easy testing:
cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
present = set([r[0] for r in cursor])
for pdffile in os.listdir(filepath):
nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
print nameid, cityid,
print "Found" if pdffile in present else "Not found"
First, I've pulled the query outside of the filename loop; no point in querying the same set of rows each time.
Secondly, I'll let MySQL generate filenames for me using CONCAT
for ease of collecting the information into a set.
Thirdly, because we now have a set of filenames, testing each individual filename against the set is a simple pdffile in present
test.
And finally, I've simplified your filename splitting logic to one line.
Now, if all you want is a set of filenames that are not present yet in the database (rather than enumerate which ones are and which ones are not), just use a set operation:
cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
present = set([r[0] for r in cursor])
for pdffile in (set(os.listdir(filepath)) - present):
nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
print nameid, cityid, "Found"
Here we use the .difference
operation (with the -
operator) to remove all the filenames for which there are already rows in the database, in one simple operation.