Python - Pandas x Polars - Values mapping (Lookup value)

One way in Pandas to map a Series from one value to another (or to do a 'lookup value') is with the map function.

So If I have two DataFrames (say one dimension and one fact table):

# imports
import numpy as np
import pandas as pd

# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pd.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
}).set_index('Letters')

# Fact Table - DataFrame with millions of letters.
many_letters = pd.DataFrame({
   'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

And considering that the business model require that we add to the fact table the 'Letters_id' we could do:

many_letters['letters_mapped'] = many_letters.Letters.map(letters_ids.squeeze())

That's clean and straightforward.

Now what about Polars? I mean Polars have a map function but doesn't seems to work like Pandas.

I found two ways with Polars to map a Series from one value to another (or to do a 'lookup value') but I'm feeling I'm missing a proper way of doing it.

So say we have the same dataset with Polars:

# imports
import numpy as np
import polars as pl

# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pl.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
})

# Fact Table - DataFrame with millions of letters.
many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

Now in order to add to the fact table the 'Letters_id':

Approach # 1 - We can use the join method:

many_letters = many_letters.join(
    other=letters_ids,
    on='Letters',
    how='left'
).rename({'Letters_id': 'letters_mapped'})

Approach # 2 - Use the replace (found the idea at this SO question)

# Convert the two columns DataFrame into a Python's dictionary.
letters_dict = dict(letters_ids.iter_rows())

# Maps the dictionary
many_letters = many_letters.with_columns(
    pl.col('Letters').replace(letters_dict).alias('letters_mapped')
)

Just as an information when considering performance, both Polars approaches above are faster to do the job then Pandas:

Pandas mapping execution time (for 10 million rows): average 0.35 seconds
Polars mapping execution time (for 10 million rows): average 0.21 seconds (Approach # 1)
Polars mapping execution time (for 10 million rows): average 0.20 seconds (Approach # 2)

Which of these is preferred or is there a 3rd approach which may be better still?

Solution

Bottom line first, joining the two dfs is the best way BUT replace is a join so between the two of them, just do whichever is easier. Most of the time stick with these 2 approaches.

The only conceivable third option, if you're dealing specifically with strings and indices, is to use the Categorical type instead. Categoricals, at their base level, just stash the unique values away and represent them with an integer (technically a u32) index in data. You can control what integer will be assigned to a value by using the StringCache (see below for links to more details on doing that). You can extract

with pl.StringCache():     
    letters_ids = pl.DataFrame({
        'Letters': ['','A', 'B', 'C'], 
                   ## I added an empty string to the beginning 
                   ## to get 1 indexing since categoricals use 0 indexing
    }, schema={'Letters':pl.Categorical()})
    many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)}, schema={'Letters':pl.Categorical()
    })

At this point you could do

many_letters.with_columns(letter_id=pl.col('Letters').to_physical())
shape: (10_000_000, 2)
┌─────────┬───────────┐
│ Letters ┆ letter_id │
│ ---     ┆ ---       │
│ cat     ┆ u32       │
╞═════════╪═══════════╡
│ C       ┆ 3         │
│ C       ┆ 3         │
│ A       ┆ 1         │
│ A       ┆ 1         │
│ …       ┆ …         │
│ B       ┆ 2         │
│ B       ┆ 2         │
│ B       ┆ 2         │
│ C       ┆ 3         │
└─────────┴───────────┘

There are some major caveats here. If the id matches a database id and it is important to carry then using categoricals is a little tougher. If the id needs to be anything other than uint then they're out.

This issue report is illustrative of the difficulties of dealing with categoricals.

Here's an SO question with some other tricks to manipulate underlying index