Search code examples
pythonstringcsvsearchnested-loops

Python - Search strings in one CSV file with nested for-loop


I'm new to stackoverflow and learning with python.

I'm having some troubles with searching if several strings (taken from an input file) are in one CSV file using Python.

Basically, my python code takes strings one after the other from an input file (inputfile.csv), and searches if each string is in the first column of another file called mainfile.csv. It only compares with the first column of the mainfile.csv, which contains the relevant data I'm looking for.

Note : Files are quite big, over 1 million rows (and growing) for mainfile.csv and usually around 30,000 rows for inputfile.csv.

Here is the code.

#!/usr/bin/python
import csv

mainfile = open('mainfile.csv', 'rb')
inputfile = open('inputfile.csv', 'rb')

mfreader = csv.reader(mainfile, delimiter=',') # mainfile reader
ifreader = csv.reader(inputfile) # inputfile reader, just one column, no delimeter

for ifrow in ifreader:
    for mfrow in mfreader:
        if ifrow[0] == mfrow[0]:
            print ifrow[0], mfrow[0] # This line is a print for debugging purpose
            print "Found a match for : %s " % ifrow[0]
            perform_some_operations()
        else:
            print ifrow[0], mfrow[0] # This line is a print for debugging purpose
            continue

mainfile.close()
inputfile.close()

Problem : The "nested for-loop" only steps in first row of inpufile. It "ignores" the other rows of the inputfile.csv.

EDIT

In fact my comprehension of the problem was wrong. The first for-loop do steps through all the rows of the inputfile. This is the second nested for-loop which only goes once through the iteration process. And since it reaches the end, it doesn't perform any more iteration when the first for-loop iterates.

Example files

Below are some example files. For the purpose of the example the "rows values" are simplified.

Basically, we deal with 2 files :

  • MainFile : contains a list of product informations (SerialNumber, ModelNumber, Text Infos)

  • InputFile : contains a list of SerialNumbers that I'm trying to find in the MainFile

MainFile (mainfile.csv, file size : > 1,000,000 (1M) rows)

Types : SerialNumber[varchar(64)], ModelNumber[(varchar(64)], Informations[varchar(2048)]

SerialNumber, ModelNumber, Informations
SN111aaa, MN123425, Informations for SN111aaa
SN222bbb, MN123425, Informations for SN222bbb
SN333ccc, MN456789, Informations for SN333ccc
SN444ddd, MN654321, Informations for SN444ddd
SN555eee, MN123425, Informations for SN555eee

InputFile (inputfile.csv, file size : ~30,000 (30K) rows)

Types : SerialNumber[varchar(64)]

SN000xyz
SN111xyz
SN222xyz
SN333xyz
SN444ddd

In the above example, since SN444ddd is the only string that can be found in both inputfile and mainfile, my python code should return me (if we dismiss the debugging lines):

Found a match for SN444ddd

And then I could perform some operations.

But it does not. What I get from the debugging print lines is :

$ ./myprogram.py
SN000xyz SerialNumber
SN000xyz SN111aaa
SN000xyz SN222bbb
SN000xyz SN333ccc
SN000xyz SN444ddd
SN000xyz SN555eee
$

Only the first row of the inputfile is processed.

EDIT WRONG. cf. previous edit.

It also performs comparison with the mainfile.csv header, but that "issue" is not reallly important.

Where did I get it wrong?

Thanks for your help.


Solution

  • The main problem seems to be that ifreader and mfreader are iterators, which means once they have exhausted the list of available items, they will not start over.

    A secondary problem is that your approach is rather inefficient. Instead of going through an iterator again and again in the inner loop, I recommend making a set of serial numbers from inputfile.csv. Sets cannot contain duplicate values and they are very efficient to check for the existence of a value.

    So your code could look like this:

    #!/usr/bin/python
    import csv
    
    def perform_some_operations():
        # ...
        pass
    
    with open('inputfile.csv', 'rb') as inputfile:
        ifreader = csv.reader(inputfile) # inputfile reader, just one column, no delimeter
        serial_numbers = {row[0] for row in ifreader}
    
    with open('mainfile.csv', 'rb') as mainfile:
        mfreader = csv.reader(mainfile, delimiter=',') # mainfile reader
    
        for row in mfreader:
            if row[0] in serial_numbers:
                print "match for    : %s " % row[0]
                perform_some_operations()
            else:
                print "NO MATCH for : %s " % row[0]
    

    Here I have used a set comprehension (the curly braces) to fill the set with values from ifreader. After that it's easy to check the set for specific values with the in operator.


    Note - instead of using 'rb' mode to read your files, you really should use the codings module and specify the file encoding when opening the file.

    import codecs
    
    with codecs.open('inputfile.csv', 'r', encoding='utf8') as inputfile:
        ...
    

    Use the proper encoding argument that matches your source data. In Python 3 the open() function supports the encoding argument natively, in Python 2 the module is there to help.