Search code examples
hashhash-collision

How to generate a collision with a hash-string-to-int method?


I'm in the process of integrating a hash method (farmhash) to our software base. The hashing services seem to work appropriately. Basically, it turns a string of characters into an unique-ish integer value.

I've added an infrastructure to detect collisions (in a case where two input strings would result in the same output integer). Basically, for each string that is hashed, I keep the [hash result] -> [string] in a map, and every time a new string is hashed, I compare it to what's in the map; if the hash is already there, I make sure that it is the same string that has generated it. I am aware that it's potentially slow and it's potentially memory consuming, but I'm performing theses checks only on a "per request" basis: they are not enabled in release mode.

Now I'd like to test that infrastructure (as in get a collision, from a unit test point of view).

I could generate a bunch of strings (random or sequential), spam my hash infrastructure and hope to see a positive collision but I feel I'll waste my time, CPU cycles and fill the memory with a load of data without success.

How would one go about generating collisions?

Not-so-relevant-facts:

  • I'm using c++;
  • I can generate data using python;
  • The target int is uint32_t.

Update:

I have created a small naive program to brute force the detection of collision:

void
addToQueue(std::string&& aString)
{
  //std::cout << aString << std::endl;
  hashAndCheck( aString ); // Performs the hash and check if there is a collision
  if ( mCount % 1000000 )
    std::cout << "Did " << mCount << " checks so far" << std::endl;
  mQueue.emplace( aString );
}


void 
generateNextRound( const std::string& aBase )
{
  //48 a 122 incl
  for ( int i = 48; i <= 122; i++ )
  {
    addToQueue( std::move( std::string( aBase ).append( 1, static_cast<char>( i ) ) ) );
  }
}


int main( void )
{

  // These two generate a collision
  //StringId id2 = HASH_SID( "@EF" ); // Hashes only, does not check
  //StringId id1 = HASH_SID( "7\\:" ); // Hashes only, does not check

  std::string base = "";
  addToQueue( std::move( base ) );

  while ( true )
  {
    const std::string val = mQueue.front();
    mQueue.pop();
    generateNextRound( val );
  }

  return 0;
}

I could eventually have added threading and stuff in there but I didn't need it because I found a collision in about 1 second (in debug mode).


Solution

  • If you brute force search for collisions offline, you could hard code strings that cause collisions into your test so that your test is as close to production code as possible, but doesn't suffer the performance penalty of doing the brute force work each time (or, like other people have said, you can make an intentionally junky hash algorithm that causes excessive collisions)