Search code examples
python-polars

Casting a column from hexadecimal string to uint64?


As part of the kaggle competition (https://www.kaggle.com/competitions/amex-default-prediction/overview), I'm trying to take advantage of a trick where they (other competitors sharing their solution) reduce the size of a column by interpreting a hexadecimal string as a base-16 uint64. I'm trying to work out if this is possible in polars /rust:

# The python approach - this is used via .apply in pandas.
string = "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"
def func(x):
    return int(string[-16:], 16)
func(string)
# 13914591055249847850

My attempt at a solution in polars yields nearly the right answer, but the final digits are off, which is a bit confusing:

import polars as pl
def func(x: str) -> int:
    return int(x[-16:], 16)

strings = [
    "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a",
    "00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5",
]

df = pl.DataFrame({"id": strings})

result_polars = df.with_columns(pl.col("id").map_elements(func).cast(pl.UInt64)).to_series().to_list()
result_python = [func(x) for x in strings]

result_polars, result_python
# ([13914591055249848320, 11750091188498716672],
#  [13914591055249847850, 11750091188498716901])

I've also tried casting directly from utf-8 to uint64, but I get the following error, which yields nulls if I pass strict=False.

df.with_columns(pl.col("id").str.slice(-16).cast(pl.UInt64)).to_series().to_list()

# InvalidOperationError: conversion from `str` to `u64` failed 
# in column 'id' for 2 out of 2 values: ["c11a8ed792feb62a", "a310b5ae68e9d8e5"]


Solution

  • The values you return from func are:

    13914591055249847850
    11750091188498716901
    

    These values are larger than can be represented with a pl.Int64. Which is what polars uses for python's int type. If a values overflows, polars instead uses Float64, but this comes with loss of precision.

    A better solution

    Taking just the latest 16 values of a string throws away a lot of information, meaning you can easily have collisions. It's better to use a hash function that tries to avoids collisions.

    You could use the hash expression. This gives you a more qualitative hash, and will be much faster as you don't run python code.

    df.with_columns(
        pl.col("id").hash(seed=0)
    )
    
    shape: (2, 1)
    ┌─────────────────────┐
    │ id                  │
    │ ---                 │
    │ u64                 │
    ╞═════════════════════╡
    │ 478697168017298650  │
    │ 7596707240263070258 │
    └─────────────────────┘