Search code examples
pythonnumpytensorflowkeraspytorch

Hashed cross-product transformation in PyTorch


I want to implement a hashed cross product transformation like the one Keras uses:

>>> layer = keras.layers.HashedCrossing(num_bins=5, output_mode='one_hot')
>>> feat1 = np.array([1, 5, 2, 1, 4])
>>> feat2 = np.array([2, 9, 42, 37, 8])
>>> layer((feat1, feat2))
<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.]], dtype=float32)>
>>> layer2 = keras.layers.HashedCrossing(num_bins=5, output_mode='int')
>>> layer2((feat1, feat2))
<tf.Tensor: shape=(5,), dtype=int64, numpy=array([2, 0, 4, 0, 2])>

This layer performs crosses of categorical features using the "hashing trick". Conceptually, the transformation can be thought of as: hash(concatenate(features)) % num_bins.

I'm struggling to understand the concatenate(features) part. Do I have to do the hash of each "pair" of features?

In the meantime, I tried with this:

>>> cross_product_idx = (feat1*feat2.max()+1 + feat2) % num_bins
>>> cross_product = nn.functional.one_hot(cross_product_idx, num_bins)

It works, but not using a hash function can cause problems with distributions


Solution

  • I could trace it to this part of the code where they simply use "X" as a string separator on one set of crossed values from various features.

    I'm struggling to understand the concatenate(features) part. Do I have to do the hash of each "pair" of features?

    If you are crossing two features, for each pair of values from each feature, you would need to "combine" them somehow (which is what they term as "concatenation"). The concatenation I see from the code is just string concatenation using the separator "X".

    So if you have feature A: "A1", "A2" and feature B: "B1", "B2", "B3", you would need to do

    • hash("A1_X_B1") % num_bins
    • hash("A1_X_B2") % num_bins
    • hash("A1_X_B3") % num_bins
    • hash("A2_X_B1") % num_bins
    • hash("A2_X_B2") % num_bins
    • hash("A2_X_B2") % num_bins

    and then one-hot encode these numbers if you want.

    Tensoring the operations

    I'm going to assume your features are categorical but numeric IDs, because if they were strings you would need to additionally map them out to integers.

    PRIME_NUM = 2_147_483_647
    
    def feature_cross(feature_a: torch.Tensor, feature_b: torch.Tensor, num_bins: int) -> torch.Tensor:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        feature_a = feature_a.to(device)
        feature_b = feature_b.to(device)
        
        # Add an additional dimension and repeat the feature to match the other feature's size
        a_expanded = feature_a.unsqueeze(1).expand(-1, feature_b.size(0))
        # Add an additional dimension and repeat the feature to match the other feature's size
        b_expanded = feature_b.unsqueeze(0).expand(feature_a.size(0), -1)
        
        combined = (a_expanded.long() * PRIME_NUM + b_expanded.long())
        
        hashed = combined % num_bins
        
        return hashed
    
    feature_a = torch.tensor([1001, 1002, 1003, 1004], dtype=torch.long)
    feature_b = torch.tensor([2001, 2002, 2003, 2004, 2005], dtype=torch.long)
    num_bins = 1000
    
    result = feature_cross(feature_a, feature_b, num_bins)
    print(result)
    

    To take an example, if A = [1,2,3] and B = [4,5], we are expanding them to

    # a_expanded
    
    1 1
    2 2
    3 3
    
    # b_expanded
    
    4 5
    4 5
    4 5
    
    

    and combining them through addition (with prime number multiplication) to achieve a cross.

    You're right that using tuples can also be an option for combining the values since tuples can be hashed, but I don't know of a tensorised way of creating tuples.