Search code examples
pythondictionaryhashtable

Create dict from a string or list


Background

I want to generate a hash table for a given string or given list. The hash table treat element as key and showup times as value. For instance:

s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}

My attempt

# method 1
from collections import Counter
s = 'ababcd' 
hash_table1 = Counter(s)

# method 2
s = 'ababdc' 
hash_table2 = dict()
for i in s:
    if hash_table2.get(i) == None:
        hash_table2[i] = 1
    else:
        hash_table2[i] += 1
hash_table1 == hash_table2

True

Usually, I use 2 methods above. One is from standard library, but is not allowed in some code practice sites. Another is written from scratch but I think it's too long. If I use dict comprehension, I come up with 2 additional methods:

{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}

Question

I am wondering if there are other methods to initialize a hash table from string of list, clearly or efficiently?

Speed comparation of my 4 methods mentioned

from collections import Counter
import random,string,numpy,perfplot

def from_set(s):
    return {i:s.count(i) for i in set(s)}

def from_string(s):
    return {i:s.count(i) for i in s}

def handy(s):
    hash_table2 = dict()
    for i in s:
        if hash_table2.get(i) == None:
            hash_table2[i] = 1
        else:
            hash_table2[i] += 1
    return hash_table2

def counter(s):
    return Counter(s)

perfplot.show(
    setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set,from_string,handy,counter],
    labels=['set','string','handy','counter'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

Speed comparation of methods


Solution

  • I typically used Counter or defaultdict for creating frequency of occurrences.

    Surprisingly found poster's method of from_set outperforms both most of the time.

    Observations

    1. from_set (labeled 'set') performs the best overall
    2. Various dictionary methods are only better for smaller string lengths (i.e. < 100)
    3. Counter method is only better for a small range of string lengths.
    4. from_set is 2.3X faster than defaultdict and 1.5X faster than Counter for large strings

    Algorithms

    from collections import Counter
    from collections import defaultdict
    
    import random,string,numpy,perfplot
    
    def from_set(s):
        " Use builtin count function for each item in set "
        return {i:s.count(i) for i in set(s)}
    
    def counter(s):
        " Uses counter module "
        return Counter(s)
    
    def normal_dic(s):
      " Update dictionary by checking if item in it or not "
      d = {}
      for i in s:
        if i in d:
          d[i] += 1
        else:
          d[i] = 1
    
      return d
    
    def setdefault_dic(s):
      " Use setdefault to preset unknown keys "
      d = {}
      for i in s:
        d.setdefault(i, 0)
        d[i] += 1
    
      return d
    
    def default_dic(s):
        " Used defaultdict from collections module "
        d = defaultdict(int)
        for i in s:
            d[i] += 1
        return d
    
    def try_dic(s):
        " Use try/except to check if item in dictionary "
        d = {}
        for i in s:
            try:
                d[i] += 1
            except:
                d[i] = 1
    
        return d
    

    Test Code

    Uses Perfplot Module

    out = perfplot.bench(
       setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
        kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
        labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
        n_range=[2 ** k for k in range(17)],
        xlabel="len(string)",
        equality_check= None
        # More optional arguments with their default values:
        # title=None,
        # logx="auto",  # set to True or False to force scaling
        # logy="auto",
        # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
        # automatic_order=True,
        # colors=None,
        # target_time_per_measurement=1.0,
        # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
        # relative_to=1,  # plot the timings relative to one of the measurements
        # flops=lambda n: 3*n,  # FLOPS plots
        )
    out.show()
    #out.save("perf.png")
    out
    

    Charts

    Absolute Values

    from_set label 'set' in the diagram. It's easier to relative performance on the relative diagram below than this absolute diagram.

    Absolute Values

    Relative Values

    from_set label 'set' in the diagram.

    from_set method is the horizontal line. All other methods including Counter and defaultdict are above it (more time consuming) for larger values.

    Relative Values

    Table

    Actual times

           n  setdefault     try_dic  defaultdict    counter    from_set
         1.0       799.0       899.0       1299.0     6099.0     1399.0
         2.0      1099.0      1199.0       1599.0     6299.0     1699.0
         4.0      1699.0      1699.0       2199.0     6299.0     2399.0
         8.0      3199.0      3099.0       3499.0     6899.0     3699.0
        16.0      6099.0      5499.0       5899.0     7899.0     5900.0
        32.0     10899.0      9299.0       9899.0     8999.0    10299.0
        64.0     20799.0     15599.0      15999.0    11899.0    15099.0
       128.0     38499.0     25499.0      25899.0    16599.0    21899.0
       256.0     73100.0     44099.0      42700.0    26299.0    30299.0
       512.0    137999.0     77099.0      72699.0    43199.0    46699.0
      1024.0    286599.0    154500.0     144099.0    85700.0    79699.0
      2048.0    549700.0    289999.0     266799.0   157499.0   145699.0
      4096.0   1103899.0    577399.0     535499.0   309399.0   278999.0
      8192.0   2200099.0   1151500.0    1051799.0   606999.0   542499.0
     16384.0   4658199.0   2534399.0    2295300.0  1414199.0  1087799.0
     32768.0   9509200.0   5270200.0    4838000.0  3066600.0  2177200.0
     65536.0  19539500.0  10806300.0    9942100.0  6503299.0  4337599.0