Search code examples
data-structuresduplicatespython-datamodel

Fast data structure in Python for indexing a bunch of images as duplicates


Introduction: I want to replace about 280'000 images of math formulas on the Encyclopedia of Mathematics by their corresponding TEX code. To do this, I have classified all of these images (or better: their URLs) into a list of 100'000 lists.

Each "sublist" contains strings of urls such that each url in that sublist links to the same image. The list looks something like [["https://www.encyclopediaofmath.org/legacyimages/a/a130/a130010/a1300105.png", "https://www.encyclopediaofmath.org/legacyimages/a/a010/a010080/a01008021.png", ...], ["https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w1300801.png", "https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w130080211.png"], ...].

For each sublist, I have (or am still in the process of) determined the corresponding TEX code for one image of that sublist. Since the images within each sublist are identical, I have (or still am) determined the TEX code for every image url in the whole list.

Now I want to replace the images in each article (such as this one) by the known TEX code. This results in me having to index the image URLs of each article in this list of sublists.

My question: Do you know of any better data structures than a list of lists for the above task?

Example code:

dups = [[i, i+1] for i in range(100000)]
for i in range(10000):
    for j in range(100000):
        if i in dups[j]:
            print(f"Found number {i} in {j}-th list")
            break

In the above example, dups is a simplified version of my list of lists (and I am using numbers instead of strings.) As you can notice, the above program takes some time to finish. I would like to improve dups so that a similar type of indexing can be done faster.

Remark 1: The above code essentially makes 1 + 2 + 3 + ... + n comparisons if dups has a length of n. This leads to n * (n+1)/2 comparisons. Since n = 100'000 in my case, this is already a lot of comparisons.

Remark 2: An obvious improvement would be to transform each sublist into a Python set and to consider a list of sets. However, most of my sublists contain less than 3 images, so I doubt that this would greatly improve runtime.

Remark 3: Note that I can hardly control the order of the "incoming" images (basically I have to follow the article structure) and that I can not construct a full order inside the list of lists (since I can not break the sublists apart.) I have thus not figured out a way to implement binary search.


Solution

  • While it might introduce data redundancy I would propose to use a binary search tree. Your list of lists is a good idea for indexing but it has one significant issue, which indeed is the runtime.

    Your metric for the tree could simply be an alphabetic comparison of the links (a < z, aa > z etc.). Thus, essentially you have binary search and just some redundant data. If we do the math, you have 280,000 images, which means the average search time in a BST will be log[2](280,000), which is approximately 18 steps. That you have about three of the same TEX codes really does not matter considering the improvement in speed, just store it 3 times. Treat it like a key and value pair. In your BST the key is your link, the corresponding value is just stored with it (which you can use your list of lists for). You can also have the value of your pair be the index of the sublist it is in. But my general suggestion would be to ignore your sublists when searching and use them again when you're done with it.

    A tree would look something like this:

                                    (link, code/index)
                                  /                     \
                          (link,code/index)       (link, code/index)
                                / \                      / \
                                etc.                     etc.
    

    If you want to or have to stick to your sublist idea then my only suggestion is to create a dictionary based on your lists. See here for the time complexity of that.

    Though if possible I would implement this in a language which has pointers or in such a way that your code for every link is the same object to save space.