Search code examples
pythonpython-polars

Create new column in df based on membership of values from another column in a dictionary


Python 3.12.3 Polars 1.8.2 Polars-lts-cpu: 1.10.0 OS: Linux-lite 24.04 VM

I have the following code:

import polars as pl

countries = ['usa', 'france', 'japan', 'brazil', 'new_zealand']
calling_codes = [1, 33, 81, 55, 64]

df = pl.DataFrame({'country': countries, 'calling_code': calling_codes })

capitals_dict = {'usa':'washington_dc', 'france': 'paris', 'brazil': 'brasilia'}

I would like to create a new column called capital in df that gets filled from the values in capitals_dict if the country that is found in df['country'] is in the keys of capitals_dict.

I have tried using replace:

df.with_columns(capital = pl.col('country').replace(capitals_dict))
shape: (5, 3)
┌─────────────┬──────────────┬───────────────┐
│ country     ┆ calling_code ┆ capital       │
│ ---         ┆ ---          ┆ ---           │
│ str         ┆ i64          ┆ str           │
╞═════════════╪══════════════╪═══════════════╡
│ usa         ┆ 1            ┆ washington_dc │
│ france      ┆ 33           ┆ paris         │
│ japan       ┆ 81           ┆ japan         │
│ brazil      ┆ 55           ┆ brasilia      │
│ new_zealand ┆ 64           ┆ new_zealand   │
└─────────────┴──────────────┴───────────────┘

But it will fill the rows for japan and new_zealand with the country name. How would I go about assigning a default value for countries not in the capitals_dict but in the countries and calling_codes lists?

So that I get something like this instead:

shape: (5, 3)
┌─────────────┬──────────────┬───────────────┐
│ country     ┆ calling_code ┆ capital       │
│ ---         ┆ ---          ┆ ---           │
│ str         ┆ i64          ┆ str           │
╞═════════════╪══════════════╪═══════════════╡
│ usa         ┆ 1            ┆ washington_dc │
│ france      ┆ 33           ┆ paris         │
│ japan       ┆ 81           ┆ [default]     │ # <-
│ brazil      ┆ 55           ┆ brasilia      │
│ new_zealand ┆ 64           ┆ [default]     │ # <-
└─────────────┴──────────────┴───────────────┘

Solution

  • Depending on the goal, there are 2 replace functions:

    If you want to keep the original value for a non-match, you can use replace

    df.with_columns(capital = pl.col("country").replace(capitals_dict))
    
    shape: (5, 3)
    ┌─────────────┬──────────────┬───────────────┐
    │ country     ┆ calling_code ┆ capital       │
    │ ---         ┆ ---          ┆ ---           │
    │ str         ┆ i64          ┆ str           │
    ╞═════════════╪══════════════╪═══════════════╡
    │ usa         ┆ 1            ┆ washington_dc │
    │ france      ┆ 33           ┆ paris         │
    │ japan       ┆ 81           ┆ japan         │ # non-match unchanged
    │ brazil      ┆ 55           ┆ brasilia      │
    │ new_zealand ┆ 64           ┆ new_zealand   │ # non-match unchanged
    └─────────────┴──────────────┴───────────────┘
    

    If you want to replace with a value of a different dtype, or if you want a default value for "non-matches" - you can use replace_strict

    df.with_columns(
       pl.col("country").replace_strict(capitals_dict, default="NOT FOUND")
         .alias("capital")
    )
    
    shape: (5, 3)
    ┌─────────────┬──────────────┬───────────────┐
    │ country     ┆ calling_code ┆ capital       │
    │ ---         ┆ ---          ┆ ---           │
    │ str         ┆ i64          ┆ str           │
    ╞═════════════╪══════════════╪═══════════════╡
    │ usa         ┆ 1            ┆ washington_dc │
    │ france      ┆ 33           ┆ paris         │
    │ japan       ┆ 81           ┆ NOT FOUND     │
    │ brazil      ┆ 55           ┆ brasilia      │
    │ new_zealand ┆ 64           ┆ NOT FOUND     │
    └─────────────┴──────────────┴───────────────┘