As part of the kaggle competition (https://www.kaggle.com/competitions/amex-default-prediction/overview), I'm trying to take advantage of a trick where they (other competitors sharing their solution) reduce the size of a column by interpreting a hexadecimal string as a base-16 uint64. I'm trying to work out if this is possible in polars /rust:
# The python approach - this is used via .apply in pandas.
string = "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"
def func(x):
return int(string[-16:], 16)
func(string)
# 13914591055249847850
My attempt at a solution in polars yields nearly the right answer, but the final digits are off, which is a bit confusing:
import polars as pl
def func(x: str) -> int:
return int(x[-16:], 16)
strings = [
"0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a",
"00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5",
]
df = pl.DataFrame({"id": strings})
result_polars = df.with_columns(pl.col("id").map_elements(func).cast(pl.UInt64)).to_series().to_list()
result_python = [func(x) for x in strings]
result_polars, result_python
# ([13914591055249848320, 11750091188498716672],
# [13914591055249847850, 11750091188498716901])
I've also tried casting directly from utf-8 to uint64, but I get the following error, which yields null
s if I pass strict=False
.
df.with_columns(pl.col("id").str.slice(-16).cast(pl.UInt64)).to_series().to_list()
# InvalidOperationError: conversion from `str` to `u64` failed
# in column 'id' for 2 out of 2 values: ["c11a8ed792feb62a", "a310b5ae68e9d8e5"]
The values you return from func
are:
13914591055249847850
11750091188498716901
These values are larger than can be represented with a pl.Int64
. Which is what polars uses for python's int
type. If a values overflows, polars instead uses Float64
, but this comes with loss of precision.
Taking just the latest 16
values of a string throws away a lot of information, meaning you can easily have collisions. It's better to use a hash function that tries to avoids collisions.
You could use the hash
expression. This gives you a more qualitative hash, and will be much faster as you don't run python code.
df.with_columns(
pl.col("id").hash(seed=0)
)
shape: (2, 1)
┌─────────────────────┐
│ id │
│ --- │
│ u64 │
╞═════════════════════╡
│ 478697168017298650 │
│ 7596707240263070258 │
└─────────────────────┘