Search code examples
pythonalgorithmcsvbinary-searchlinear-search

Python Linear Search Better Efficiency


I've got a question regarding Linear Searching in Python. Say I've got the base code of

for l in lines:
  for f in search_data:
     if my_search_function(l[1],[f[0],f[2]]):
        print "Found it!"
        break

in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:

def my_search_function(search_key, search_values):
   for s in search_values:
      if search_key in s:
         return True
    return False

Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.

for line in lines:
    negative_index = -1
    positive_index = 0
    middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
    found = False

    while positive_index < middle_element:
        # print str(positive_index)+","+str(negative_index)
        if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
            print "Found it!"
            break
        positive_index = positive_index +1
        negative_index = negative_index -1

However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!


Solution

  • Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were

    def binary_search(master_row, source_data,master_search_index, source_search_index):
        lower_bound = 0
        upper_bound = len(source_data) - 1
        found = False
        while lower_bound <= upper_bound and not found:
            middle_pos = (lower_bound + upper_bound) // 2
    
            if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
    
                if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
                    return {"result": True, "index": middle_pos}
                    break
                lower_bound = middle_pos + 1
    
            elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
    
                if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
                    return {"result": True, "index": middle_pos}
                    break
                upper_bound = middle_pos - 1
            else:
    
                if len(source_data[middle_pos][source_search_index]) > 5:
    
                    return {"result": True, "index": middle_pos}
                else:
                    break
    

    and then where we actually make the Binary Search call

    #where master_copy is the first multidimensional list, data_copy is the second
    #the search columns are the columns we want to search against
    for line in master_copy:
        for m in master_search_columns:
            found = False
            for d in data_search_columns:
                data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
                results = binary_search(line, data_copy,m, d)
                found = results["result"]
                if found:
                    line = update_row(line, data_copy[results["index"]], column_mapping)
                    found_count = found_count +1
                    break
            if found:
                break
    

    Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray