Search code examples
c++compressionlossless-compressionlzw

What does the parameter assigned to an unordered_map<> name hold?


I was going through a great article on LZW compression algorithm by Mark Nelson, and found something in the code I haven't yet encountered.

In the code, he used unordered_map to store strings and their corresponding frequency. The declaration of the map was:

std::unordered_map<std::string, unsigned int> codes( (max_code * 11)/10 );

max_code stores the maximum number of entries in the map, i.e. 32767. The code:

void compress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767 )
{
    //code
}

I am unaware as to what parameter does an unsigned int value associated with codes hold. Also, could someone enlighten me as to why the max_code value is multiplied by 11 and then divided by 10?

Here is the compress function for reference:

template<class INPUT, class OUTPUT>
void compress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767 )
{
  input_symbol_stream<INPUT> in( input );
  output_code_stream<OUTPUT> out( output, max_code );

  std::unordered_map<std::string, unsigned int> codes( (max_code * 11)/10 );
  for ( unsigned int i = 0 ; i < 256 ; i++ )
    codes[std::string(1,i)] = i;
  unsigned int next_code = 257;
  std::string current_string;
  char c;
  while ( in >> c ) {
    current_string = current_string + c;
    if ( codes.find(current_string) == codes.end() ) {
      if ( next_code <= max_code )
        codes[ current_string ] = next_code++;
      current_string.erase(current_string.size()-1);
      out << codes[current_string];
      current_string = c;
    }
  }
  if ( current_string.size() )
    out << codes[current_string];
}

Solution

  • The std::unordered_map type template has a number of constructors that accept a size_type; typically std::size_t, which is an unsigned integer. This value is related to the semantics, or rather the typical implementation, of an unordered map as a hash-set. From the link above:

    explicit unordered_map(size_type bucket_count,
                           const Hash& hash = Hash(),
                           const key_equal& equal = key_equal(),
                           const Allocator& alloc = Allocator() );
    
    1. Constructs empty container. Sets max_load_factor() to 1.0. For the default constructor, the number of buckets is implementation-defined.

    ...

    bucket_count - minimal number of buckets to use on initialization. If it is not specified, implementation-defined default value is used

    It serves as an indicator of how much data you expect the container to hold. This of course will not be perfect because of hash collisions. But it serves as a hint for the container for how much memory to allocate, similar to what std::vector::reserve() would do.

    My guess is that by choosing slightly more than you actually need, here 1.1 as much, you are more likely to avoid reallocations on the bucket-level. 1.1 sounds a bit too conservative to me, I might have gone for 1.5 or higher. But possibly experiments were made to determine what safety-factor to use.