Search code examples
javascriptarrayslookupstring-hashing

Hashing array of strings in javascript


Just wondering if there is some other way than this.

var hashStringArray = function(array) {
    array.sort();
    return array.join('|');
};

I don't like sorting much and using that delimiter is not safe either if it's contained in one of the strings. In overall I need to produce same hash no matter the order of strings. It will be rather short arrays (up to 10 items), but it will be required very often so it shouldn't be too slow.

I intend to use it with ES6 Map object and I need to easily find same array collection.

Updated example of use

var theMap = new Map();
var lookup = function(arr) {
    var item = null;
    var hashed = hashStringArray(arr);
    if (item = theMap.get( hashed )) {
        return item;
    }
    theMap.set( hashed, itemBasedOnInput );
    return itemBasedOnInput;
}

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];

lookup(arr1) === lookup(arr2)

Performance tests

http://jsperf.com/hashing-array-of-strings/5


Solution

  • Two things occurred to me as the basis of a solution:

    1. summing doesn't depend on order, which is actually a flaw in simple checksums (they don't catch changes in block order within a word), and

    2. we can convert strings to summable numbers using their charcodes

    Here's a function to do (2) :

    charsum = function(s) {
      var i, sum = 0;
      for (i = 0; i < s.length; i++) {
        sum += (s.charCodeAt(i) * (i+1));
      }
      return sum
    }
    

    Here's a version of (1) that computes an array hash by summing the charsum values:

    array_hash = function(a) {
      var i, sum = 0
      for (i = 0; i < a.length; i++) {
        var cs = charsum(a[i])
        sum = sum + (65027 / cs)
      }
      return ("" + sum).slice(0,16)
    }
    

    Fiddle here: http://jsfiddle.net/WS9dC/11/

    If we did a straight sum of the charsum values, then the array ["a", "d"] would have the same hash as the array ["b", "c"] - leading to undesired collisions. So based on using non-UTF strings, where charcodes go up to 255, and allowing for 255 characters in each string, then the max return value of charsum is 255 * 255 = 65025. So I picked the next prime number up, 65027, and used (65027 / cs) to compute the hash. I am not 100% convinced this removes collisions... perhaps more thought needed... but it certainly fixes the [a, d] versus [b, c] case. Testing:

    var arr1 = ['alpha','beta','gama'];
    var arr2 = ['beta','alpha','gama'];
    
    console.log(array_hash(arr1))
    console.log(array_hash(arr2))
    console.log(array_hash(arr1) == array_hash(arr2))
    

    Outputs:

    443.5322979371356 
    443.5322979371356
    true 
    

    And testing a case that shows different hashes:

    var arr3 = ['a', 'd'];
    var arr4 = ['b', 'c'];
    
    console.log(array_hash(arr3))
    console.log(array_hash(arr4))
    console.log(array_hash(arr3) == array_hash(arr4))
    

    outputs:

    1320.651443298969
    1320.3792001649144
    false 
    

    Edit:

    Here's a revised version, which ignore duplicates from the arrays as it goes, and return the hash based on unique items only:

    http://jsfiddle.net/WS9dC/7/

    array_hash = function(a) {
      var i, sum = 0, product = 1
      for (i = 0; i < a.length; i++) {
        var cs = charsum(a[i])
        if (product % cs > 0) {
          product = product * cs
          sum = sum + (65027 / cs)  
        }
      }
      return ("" + sum).slice(0, 16)
    }
    

    testing:

    var arr1 = ['alpha', 'beta', 'gama', 'delta', 'theta', 'alpha', 'gama'];
    var arr2 = ["beta", "gama", "alpha", "theta", "delta", "beta"];
    
    console.log(array_hash(arr1))
    console.log(array_hash(arr2))
    console.log(array_hash(arr1) === array_hash(arr2))
    

    returns:

    689.878503111701
    689.878503111701
    true 
    

    Edit

    I've revised the answer above to account for arrays of words that have the same letters. We need these to return different hashes, which they now do:

    var arr1 = ['alpha', 'beta']
    var arr2 = ['alhpa', 'ateb'] 
    

    The fix was to add a multiplier to the charsum func based on the char index:

    sum += (s.charCodeAt(i) * (i+1));