Search code examples
pythonexcelxlrdxlwtvba

comparing excel data from two separate sheets with python/xlrd


I have two lists of lists that were extracted from two separate excel workbooks. Each element contains two of it's own elements. These lists represent the data found in the first two columns of each excel workbook. for example:

search_terms = [['term1',300],['term2',400],['term3',200]...] #words searched on our website with number of hits for each
item_description = [[900001,'a string with term1'],[900002,'a string with term 2'],[900003,'a string with term 1 and 2']...] #item numbers with matching descriptions

My goal is to compare the strings in search_terms to the strings in item_descriptions, and to compile a list of matching item numbers from item_description for each search term. I then would like to take the top 250 terms and matching item numbers based on the amount of hits they generate.

I generated the two lists from xlrd and i was thinking i'd want to convert to tuples and work to generate a list similar to the following:

results = [['term1',300,900001,900003],['term2',400,900002,900003],['term3',200]] #search term, number of hits, and matching item numbers based on item description

I would then write the item numbers to adjacent columns to the matching term/hit in the mother excel files using xlwt for display/presentation purposes.

I'm green as grass when it comes to working with python, xlrd, and programming in general. I appreciate any input and direction along with sensitivity to my naivete in regards to my approach.


Solution

  • You're on the right track, but I think what you'd want here isntead is a dictionary with the term as the key and a list of values as the value. Which would end up looking something like this:

    {
        'term1': [300, 900001,900003],
        'term2': [400,900002,900003],
        'term3': [200]  # are there numbers missing from this one?
    }
    

    Here's what the code for this might look like:

    import re
    from collections import defaultdict
    
    search_terms = [['term1',300],['term2',400],['term3',200]] #words searched on our website with number of hits for each
    item_description = [[900001,'a string with term1'],[900002,'a string with term2'],[900003,'a string with term1 and term2']]
    
    d = defaultdict(list)
    i = 0
    
    for item in search_terms:
        d[item[0]].append(item[1])
        rgx = re.compile(item[0])
        for info in item_description:
            matches = re.findall(rgx, info[1])
            if matches:
                d[item[0]].append(info[0])
            print matches
    print d
    

    Defaultdict tests if a key already exists in the dictionary, and then adds it in if it isn't. You'd then iterate over the dictionary, putting they key into the first column, then iterating over the list and putting each of those into their own column. Let me know if this doesn't fit the structure of your data and I can try and adapt it.