Adjusting nested apply() functions for large datasets

I have two dataframes that I'm trying to compare, and am facing a volume issue.

I am passing one row of a new item description through a 4.5 million row inventory list and calculating similarity. I only need the top x recommendations and am realizing my current approach quickly gets overwhelmed with the volume of data and is crashing the kernel.

I have not dealt with this data size before, so I am unsure how to adjust my code.

Any advice is greatly appreciated. The current approach was to put the data into the dataframe first(holding_df) and then groupby to collect the top recommendations, but once this process is scaled to the full size of the data, it crashes.

> df.head()

   item_desc  
0  paintbrush  
1  mop #2  
2  red bucket  
3  o-light flashlight

> df_inventory.head()
   item_desc  
0  broom  
1  mop  
2  bucket  
3  flashlight

import pandas as pd

from fuzzywuzzy import fuzz


def calculate_similarity(x, y):
    sample_list.append(
        {
            "New Item": x,
            "Inventory Item": y,
            "Similarity": fuzz.ratio(str(x).lower(), str(y).lower()),
        }
    )
    return


sample_list = []

df = pd.DataFrame(
    {"ITEM_DESC": ["paintbrush", "mop #2", "red bucket", "o-light flashlight"]}
)

df_inventory = pd.DataFrame({"ITEM_DESC": ["broom", "mop", "bucket", "flashlight"]})

temp = df["ITEM_DESC"].apply(
    lambda x: df_inventory["ITEM_DESC"].apply(lambda y: calculate_similarity(x, y))
)

holding_df = pd.DataFrame(sample_list)

Solution

I implemented something in plain Python that won't break your kernel, but it won't be super fast.

It takes about 6-7 seconds to compare a single new product with the whole inventory. That will probably be too slow for 3.5k items (about 6h and 20 min if I'd run it on my machine). With some work, it can be parallelized though.

6.5s per new item
3500 * 6.5 / 3600 (s/h) -> 6h 20min

The main memory-saver is the FixedSizeLeaderboard class that I implemented to keep track of the top n most similar items for a new product. As the task is now CPU-Bound and not really memory-bound, you can benefit from rewriting it a bit to use the multiprocessing module.

I decided to just generate some test data that may or may not represent actual performance. I added a few comments where you'd plug in your data.

import bisect
import collections
import contextlib
import itertools
import time
import typing
import uuid

from fuzzywuzzy import fuzz


@contextlib.contextmanager
def log_runtime(task: str):
    """Contextmanager that logs the runtime of a piece of code."""

    start = time.perf_counter()

    yield

    runtime = time.perf_counter() - start

    print("Task '%s' took %.4f seconds" % (task, runtime))


def inventory_generator() -> typing.Iterable[str]:
    """Returns an iterable that yields product names."""

    def string_generator() -> typing.Iterable[str]:
        while True:
            yield str(uuid.uuid4())
            yield from ("aaa", "aba", "def", "dse", "asd")

    yield from string_generator()


class FixedSizeLeaderboard:

    size: int
    _min_score: int
    _items: typing.List[typing.Tuple[int, object]]

    def __init__(self, size) -> None:
        self.size = size
        self._items = []
        self._min_score = None

    def add(self, score: int, item: object) -> None:

        if len(self._items) < self.size or score > self._min_score:
            self._eject_element_with_lowest_score()
            bisect.insort(self._items, (score, item))
            self._min_score = self._items[0][0]

    def _eject_element_with_lowest_score(self) -> None:
        if len(self._items) == self.size:
            # The list is sorted, so we can pop the first one
            self._items.pop(0)

    def get(self) -> typing.List[typing.Tuple[int, object]]:
        return sorted(self._items, reverse=True)


def main():

    num_new_products = 2
    num_products_in_inventory = 4_500_000
    top_n_similarities = 3

    with log_runtime("Generate dummy-products"):

        # Convert everything to lowercase once.
        # This is not really required for uuids, but it should happen ONCE
        # Instead of the inventory_generator, you'd pass the content of your dataframe here.
        new_products = list(
            map(str.lower, itertools.islice(inventory_generator(), num_new_products))
        )
        inventoried_products = list(
            map(
                str.lower,
                itertools.islice(inventory_generator(), num_products_in_inventory),
            )
        )

    task_desc = (
        f"{num_new_products} x {num_products_in_inventory}"
        f" = {num_new_products * num_products_in_inventory} similarity computations"
    )

    product_to_leaderboard: typing.Dict[
        str, FixedSizeLeaderboard
    ] = collections.defaultdict(lambda: FixedSizeLeaderboard(top_n_similarities))

    with log_runtime(task_desc):
        for new_product, existing_product in itertools.product(
            new_products, inventoried_products
        ):

            similarity = fuzz.ratio(new_product, existing_product)
            product_to_leaderboard[new_product].add(similarity, existing_product)

    # Sort of pretty output formatting
    for product, similarities in product_to_leaderboard.items():
        print("=" * 3, "New Product", product, "=" * 3)
        for position, (score, product) in enumerate(similarities.get()):
            print(f"{position + 1:02}. score: {score} product: {product}")


if __name__ == "__main__":
    main()

If we execute it, we get something like this:

$ python apply_thingy.py
Task 'Generate dummy-products' took 1.6449 seconds
Task '2 x 4500000 = 9000000 similarity computations' took 12.0887 seconds
=== New Product 2d10f990-355e-42f6-b518-0a21a7fb8d5c ===
01. score: 56 product: f2100878-3c3e-4f86-b410-3c362184d195
02. score: 56 product: 5fc9b30c-35ed-4167-b997-1bf0a2af5b68
03. score: 56 product: 523210b2-e5e0-496a-b0b1-a1b2af49b0d5
=== New Product aaa ===
01. score: 100 product: aaa
02. score: 100 product: aaa
03. score: 100 product: aaa