Search code examples
mysqlcursorpythonfetchall

How to do this lookup more elegantly in Python and MySQL?


I have a bunch of files in a directory named with nameid_cityid.txt, nameid and cityid being the ids of name (integer(10)) and city (integer(10)) in mydata table.

While the following solution works, I am doing type conversions since fetchall fetches 'L' and the file name tuple of nameid, cityid are strings,..

If you can suggest a pythonic or more elegant way of doing the same, that will be awesome, for me and the communtiy!

What I am trying to achieve : Find those files from a directory that don't have a record in the database and then do something with that file, like parse/move/delete it.

MySQL table mydata :

nameid  cityid
15633   45632
2354    76894

Python :

for pdffile in os.listdir(filepath):
    cityid, nameid = pdffile.strip('.txt').split('_')[0], pdffile.strip('.txt').split('_')[1]    
    cursor.execute("select cityid, nameid from mydata")
    alreadyparsed = cursor.fetchall()
    targetvalues = ((str(cityid), str(nameid)) for cityid, nameid in alreadyparsed)
    if (int(cityid), int(nameid)) in alreadyparsed:
        print cityid, nameid, "Found"
    else:
        print cityid, nameid, "Not found"

Solution

  • I'd use a set for quick and easy testing:

    cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
    present = set([r[0] for r in cursor])
    
    for pdffile in os.listdir(filepath):
        nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
        print nameid, cityid,
        print "Found" if pdffile in present else "Not found"
    

    First, I've pulled the query outside of the filename loop; no point in querying the same set of rows each time.

    Secondly, I'll let MySQL generate filenames for me using CONCAT for ease of collecting the information into a set.

    Thirdly, because we now have a set of filenames, testing each individual filename against the set is a simple pdffile in present test.

    And finally, I've simplified your filename splitting logic to one line.

    Now, if all you want is a set of filenames that are not present yet in the database (rather than enumerate which ones are and which ones are not), just use a set operation:

    cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
    present = set([r[0] for r in cursor])
    
    for pdffile in (set(os.listdir(filepath)) - present):
        nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
        print nameid, cityid, "Found"
    

    Here we use the .difference operation (with the - operator) to remove all the filenames for which there are already rows in the database, in one simple operation.