Search code examples
pythonpandasfuzzywuzzy

Adjusting nested apply() functions for large datasets


I have two dataframes that I'm trying to compare, and am facing a volume issue.

I am passing one row of a new item description through a 4.5 million row inventory list and calculating similarity. I only need the top x recommendations and am realizing my current approach quickly gets overwhelmed with the volume of data and is crashing the kernel.

I have not dealt with this data size before, so I am unsure how to adjust my code.

Any advice is greatly appreciated. The current approach was to put the data into the dataframe first(holding_df) and then groupby to collect the top recommendations, but once this process is scaled to the full size of the data, it crashes.

> df.head()

   item_desc  
0  paintbrush  
1  mop #2  
2  red bucket  
3  o-light flashlight  
> df_inventory.head()
   item_desc  
0  broom  
1  mop  
2  bucket  
3  flashlight 
import pandas as pd

from fuzzywuzzy import fuzz


def calculate_similarity(x, y):
    sample_list.append(
        {
            "New Item": x,
            "Inventory Item": y,
            "Similarity": fuzz.ratio(str(x).lower(), str(y).lower()),
        }
    )
    return


sample_list = []

df = pd.DataFrame(
    {"ITEM_DESC": ["paintbrush", "mop #2", "red bucket", "o-light flashlight"]}
)

df_inventory = pd.DataFrame({"ITEM_DESC": ["broom", "mop", "bucket", "flashlight"]})

temp = df["ITEM_DESC"].apply(
    lambda x: df_inventory["ITEM_DESC"].apply(lambda y: calculate_similarity(x, y))
)

holding_df = pd.DataFrame(sample_list)

Solution

  • I implemented something in plain Python that won't break your kernel, but it won't be super fast.

    It takes about 6-7 seconds to compare a single new product with the whole inventory. That will probably be too slow for 3.5k items (about 6h and 20 min if I'd run it on my machine). With some work, it can be parallelized though.

    6.5s per new item
    3500 * 6.5 / 3600 (s/h) -> 6h 20min
    

    The main memory-saver is the FixedSizeLeaderboard class that I implemented to keep track of the top n most similar items for a new product. As the task is now CPU-Bound and not really memory-bound, you can benefit from rewriting it a bit to use the multiprocessing module.

    I decided to just generate some test data that may or may not represent actual performance. I added a few comments where you'd plug in your data.

    import bisect
    import collections
    import contextlib
    import itertools
    import time
    import typing
    import uuid
    
    from fuzzywuzzy import fuzz
    
    
    @contextlib.contextmanager
    def log_runtime(task: str):
        """Contextmanager that logs the runtime of a piece of code."""
    
        start = time.perf_counter()
    
        yield
    
        runtime = time.perf_counter() - start
    
        print("Task '%s' took %.4f seconds" % (task, runtime))
    
    
    def inventory_generator() -> typing.Iterable[str]:
        """Returns an iterable that yields product names."""
    
        def string_generator() -> typing.Iterable[str]:
            while True:
                yield str(uuid.uuid4())
                yield from ("aaa", "aba", "def", "dse", "asd")
    
        yield from string_generator()
    
    
    class FixedSizeLeaderboard:
    
        size: int
        _min_score: int
        _items: typing.List[typing.Tuple[int, object]]
    
        def __init__(self, size) -> None:
            self.size = size
            self._items = []
            self._min_score = None
    
        def add(self, score: int, item: object) -> None:
    
            if len(self._items) < self.size or score > self._min_score:
                self._eject_element_with_lowest_score()
                bisect.insort(self._items, (score, item))
                self._min_score = self._items[0][0]
    
        def _eject_element_with_lowest_score(self) -> None:
            if len(self._items) == self.size:
                # The list is sorted, so we can pop the first one
                self._items.pop(0)
    
        def get(self) -> typing.List[typing.Tuple[int, object]]:
            return sorted(self._items, reverse=True)
    
    
    def main():
    
        num_new_products = 2
        num_products_in_inventory = 4_500_000
        top_n_similarities = 3
    
        with log_runtime("Generate dummy-products"):
    
            # Convert everything to lowercase once.
            # This is not really required for uuids, but it should happen ONCE
            # Instead of the inventory_generator, you'd pass the content of your dataframe here.
            new_products = list(
                map(str.lower, itertools.islice(inventory_generator(), num_new_products))
            )
            inventoried_products = list(
                map(
                    str.lower,
                    itertools.islice(inventory_generator(), num_products_in_inventory),
                )
            )
    
        task_desc = (
            f"{num_new_products} x {num_products_in_inventory}"
            f" = {num_new_products * num_products_in_inventory} similarity computations"
        )
    
        product_to_leaderboard: typing.Dict[
            str, FixedSizeLeaderboard
        ] = collections.defaultdict(lambda: FixedSizeLeaderboard(top_n_similarities))
    
        with log_runtime(task_desc):
            for new_product, existing_product in itertools.product(
                new_products, inventoried_products
            ):
    
                similarity = fuzz.ratio(new_product, existing_product)
                product_to_leaderboard[new_product].add(similarity, existing_product)
    
        # Sort of pretty output formatting
        for product, similarities in product_to_leaderboard.items():
            print("=" * 3, "New Product", product, "=" * 3)
            for position, (score, product) in enumerate(similarities.get()):
                print(f"{position + 1:02}. score: {score} product: {product}")
    
    
    if __name__ == "__main__":
        main()
    

    If we execute it, we get something like this:

    $ python apply_thingy.py
    Task 'Generate dummy-products' took 1.6449 seconds
    Task '2 x 4500000 = 9000000 similarity computations' took 12.0887 seconds
    === New Product 2d10f990-355e-42f6-b518-0a21a7fb8d5c ===
    01. score: 56 product: f2100878-3c3e-4f86-b410-3c362184d195
    02. score: 56 product: 5fc9b30c-35ed-4167-b997-1bf0a2af5b68
    03. score: 56 product: 523210b2-e5e0-496a-b0b1-a1b2af49b0d5
    === New Product aaa ===
    01. score: 100 product: aaa
    02. score: 100 product: aaa
    03. score: 100 product: aaa