Search code examples
machine-learningscikit-learnhashfeature-extraction

Feature Hashing of zip codes with Scikit in machine learning


I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.

The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible

I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:

from sklearn.feature_extraction import FeatureHasher

D = pd.unique(Daten["PLZ"])

print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)

h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()

print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)

unq = np.unique(f, axis=0)

print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)

For Output I received:

Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550'] 
Zipcode Shape: (8158,)
Feature Array:
 [[ 2.  1.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]] 
Feature Shape: (8158, 32)
Unique values:
 [[ 0. -3.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 ...
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]] 
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595

Any help and insights are greatly appreciated.


Solution

  • That very first 2 in the transformed data should be a clue. I think you'll also find that many of the columns are all-zero.

    From the documentation,

    Each sample must be iterable...

    So the hasher is treating the zip code '86916' as the collection of elements 8, 6, 9, 1, 6, and you only get ten nonzero columns (the first column presumably being the 6, which appears twice, as noted at the beginning). You should be able to rectify this by reshaping the input to be 2-dimensional.