Search code examples
pythonperformancecsvmemory-efficient

Accessing a particular record in a CSV FIle Efficiently- Python


I have a huge csv file and I am reading it using the Python CSV library's DictReader. It has serial no. and some information associated. In my application I am taking a list of serial no. provided by the user and checking whether those are there in the CSV file or not. First Implementation:

reader=csv.DictReader(open('sample.csv','rb'))
arr=[1000,7777,3434,2121,9999]
for row in reader:
    if row['Id'] in arr:
        print row['Title']

But this takes too long as my csv file contains over 100 000 Entries

Second Implementation:

reader=csv.DictReader(open('sample.csv','rb'))
arr=[1000,7777,3434,2121,9999]
arr.sort()
i=0
for row in reader:
    if row['Id']==arr[i]:
        print row['Title']
        i=i+1

But this gives ambiguous result, i.e sometimes it prints the Title only for first 2 or first three serial no.s in arr

I want a more efficient way and kind of direct hit on a particular serial no., Is that possible?

Please don't suggest linecache or something based on lines because my title is spread over multiple lines, so basically 1 csv record is not equal to 1 line in the file.


Solution

  • You are attempting to read a 100,000 line text file to find a tiny number of matches.
    I would seriously consider preprocessing that csv file into an sqlite3 database prior to these lookups.
    I doubt that the csv file is supplied each time that the user requests a few lookup details, so it should be possible.
    Of course this depends on how often the csv file is updated but I would bet that it isn't that often. A single preprocessing of the csv into an sqlite database, used for multiple lookups would pay dividends.

    When the only tool you have is a hammer, everything looks like a nail!

    Edit: The other thing to consider is, you think that you are having issues now, what happens when the csv file has become 2 or 3 Lakh in size. At some point you are going to have to bite the bullet and either have the csv file delivered in some structured format or structure it yourself.
    There is also the issue of what the csv file contains. Currently you have no assurance that it is not full of duplicates, which could be seriously messing up your processing. If you apply a structure to the data, not only would it be infinitesimally quicker to search but you would also be assured of clean data at the same time.

    Edit 2:

    Here is a tiny python script to create a database with 2 Lakh records.
    Clearly in your case you will have to read the csv file and populate more fields, but this simple test takes just 4.5 seconds on an old 64bit PC.

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import os,sqlite3
    db_name = "2lakh.db"
    try:
        os.remove(db_name)
    except:
        pass
    db = sqlite3.connect(db_name)
    cursor = db.cursor()
    result = cursor.execute('CREATE TABLE if not exists Big (Big_id INTEGER NOT NULL PRIMARY KEY UNIQUE, Big_data CHAR)')
    cursor.execute('PRAGMA synchronous = 0') #Hands off data handling to OS
    n = 0
    while n < 200001:
        try:
            db.execute("insert into Big (Big_id,Big_data) values (?,?)",(n,"This is some data"));
        except sqlite3.Error as e:
            print 'Big Insert Error '+str(e), 'Error'
        n += 1
        # only report progress and commit database every 10000 records (speeds things up immensely) 
        if (n % 10000) == 0:
            db.commit()
            print n, "records written"
    db.commit()
    db.close()
    

    If you only perform the db.commit() every 100,000 transactions the time taken is less than 3 seconds to create the entire database. I hope that this helps.