I want to implement a hashed cross product transformation like the one Keras uses:
>>> layer = keras.layers.HashedCrossing(num_bins=5, output_mode='one_hot')
>>> feat1 = np.array([1, 5, 2, 1, 4])
>>> feat2 = np.array([2, 9, 42, 37, 8])
>>> layer((feat1, feat2))
<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.]], dtype=float32)>
>>> layer2 = keras.layers.HashedCrossing(num_bins=5, output_mode='int')
>>> layer2((feat1, feat2))
<tf.Tensor: shape=(5,), dtype=int64, numpy=array([2, 0, 4, 0, 2])>
This layer performs crosses of categorical features using the "hashing trick". Conceptually, the transformation can be thought of as: hash(concatenate(features)) % num_bins.
I'm struggling to understand the concatenate(features)
part. Do I have to do the hash of each "pair" of features?
In the meantime, I tried with this:
>>> cross_product_idx = (feat1*feat2.max()+1 + feat2) % num_bins
>>> cross_product = nn.functional.one_hot(cross_product_idx, num_bins)
It works, but not using a hash function can cause problems with distributions
I could trace it to this part of the code where they simply use "X" as a string separator on one set of crossed values from various features.
I'm struggling to understand the concatenate(features) part. Do I have to do the hash of each "pair" of features?
If you are crossing two features, for each pair of values from each feature, you would need to "combine" them somehow (which is what they term as "concatenation"). The concatenation I see from the code is just string concatenation using the separator "X".
So if you have feature A: "A1", "A2"
and feature B: "B1", "B2", "B3"
, you would need to do
hash("A1_X_B1") % num_bins
hash("A1_X_B2") % num_bins
hash("A1_X_B3") % num_bins
hash("A2_X_B1") % num_bins
hash("A2_X_B2") % num_bins
hash("A2_X_B2") % num_bins
and then one-hot encode these numbers if you want.
Tensoring the operations
I'm going to assume your features are categorical but numeric IDs, because if they were strings you would need to additionally map them out to integers.
PRIME_NUM = 2_147_483_647
def feature_cross(feature_a: torch.Tensor, feature_b: torch.Tensor, num_bins: int) -> torch.Tensor:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_a = feature_a.to(device)
feature_b = feature_b.to(device)
# Add an additional dimension and repeat the feature to match the other feature's size
a_expanded = feature_a.unsqueeze(1).expand(-1, feature_b.size(0))
# Add an additional dimension and repeat the feature to match the other feature's size
b_expanded = feature_b.unsqueeze(0).expand(feature_a.size(0), -1)
combined = (a_expanded.long() * PRIME_NUM + b_expanded.long())
hashed = combined % num_bins
return hashed
feature_a = torch.tensor([1001, 1002, 1003, 1004], dtype=torch.long)
feature_b = torch.tensor([2001, 2002, 2003, 2004, 2005], dtype=torch.long)
num_bins = 1000
result = feature_cross(feature_a, feature_b, num_bins)
print(result)
To take an example, if A = [1,2,3]
and B = [4,5]
, we are expanding them to
# a_expanded
1 1
2 2
3 3
# b_expanded
4 5
4 5
4 5
and combining them through addition (with prime number multiplication) to achieve a cross.
You're right that using tuples can also be an option for combining the values since tuples can be hashed, but I don't know of a tensorised way of creating tuples.