Search code examples
pythonstringpython-2.7hashunique

Cannot get unique IDs for a string python2.7


I am trying to make unique ID from a list of words. I want these numbers to be globally unique. For example, if another list appears, I want the unique ID to be the same e.g. for "density", the ID might be 151111911, and this will be the same if "density" occurs in a different list.

As you can see, my current method is not working using id and intern - the ID for rrb is exactly the same as lrb.

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

featureVector = mydefaultdict(mydouble)

for featureID,featureVal in enumerate(featureList):
        print "featureID is",featureID
        print "featureVal is ",featureVal
        print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))
        featureVector[featureID] = featureVal


featureID is 0
featureVal is  guinea
Encoded feature value is 4569583120.0
featureID is 1
featureVal is  bissau
Encoded feature value is 4569581632.0
featureID is 2
featureVal is  compared
Encoded feature value is 4569583120.0
featureID is 3
featureVal is  countriesthe
Encoded feature value is 4567944360.0
featureID is 4
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 5
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 6
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 7
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 8
featureVal is  similar
Encoded feature value is 4496118144.0
featureID is 9
featureVal is  iran
Encoded feature value is 4569583120.0
featureID is 10
featureVal is  afghanistan
Encoded feature value is 4569581632.0
featureID is 11
featureVal is  cameroon
Encoded feature value is 4569583120.0
featureID is 12
featureVal is  panama
Encoded feature value is 4569581632.0
featureID is 13
featureVal is  montenegro
Encoded feature value is 4569583120.0
featureID is 14
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 15
featureVal is  belarus
Encoded feature value is 4569583120.0
featureID is 16
featureVal is  palau
Encoded feature value is 4569581632.0
featureID is 17
featureVal is  location_slot
Encoded feature value is 4567944360.0
featureID is 18
featureVal is  south
Encoded feature value is 4569583120.0
featureID is 19
featureVal is  africa
Encoded feature value is 4569581632.0
featureID is 20
featureVal is  respective
Encoded feature value is 4569583120.0
featureID is 21
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 22
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 23
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 24
featureVal is  capita
Encoded feature value is 4569581632.0
featureID is 25
featureVal is  per
Encoded feature value is 4455914152.0
featureID is 26
featureVal is  square
Encoded feature value is 4347127296.0
featureID is 27
featureVal is  kilometer
Encoded feature value is 4569581632.0
featureID is 28
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 29
featureVal is  global
Encoded feature value is 4346597072.0
featureID is 30
featureVal is  rank
Encoded feature value is 4346629984.0
featureID is 31
featureVal is  number_slot
Encoded feature value is 4569583120.0
featureID is 32
featureVal is  years
Encoded feature value is 4569581632.0
featureID is 33
featureVal is  growthguinea
Encoded feature value is 4567944360.0
featureID is 34
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 35
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 36
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 37
featureVal is  positive
Encoded feature value is 4514096160.0
featureID is 38
featureVal is  growth
Encoded feature value is 4569583120.0
featureID is 39
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 40
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 41
featureVal is  last
Encoded feature value is 4346568112.0
featureID is 42
featureVal is  years
Encoded feature value is 4569583120.0
featureID is 43
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 44
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 45
featureVal is  LOCATION_SLOT~-appos+LOCATION~-prep_of
Encoded feature value is 4538026784.0
featureID is 46
featureVal is  LOCATION~-prep_of+that~-prep_to
Encoded feature value is 6043251168.0
featureID is 47
featureVal is  that~-prep_to+similar~prep_with
Encoded feature value is 6043251168.0
featureID is 48
featureVal is  similar~prep_with+density~prep_of
Encoded feature value is 6043251168.0
featureID is 49
featureVal is  density~prep_of+NUMBER~appos
Encoded feature value is 6043251168.0
featureID is 50
featureVal is  NUMBER~appos+NUMBER~amod
Encoded feature value is 6043247024.0
featureID is 51
featureVal is  NUMBER~amod+NUMBER_SLOT
Encoded feature value is 6043247024.0

What am I doing wrong? The reason I need to convert these into floats or numbers is that the above sentence would go into a classifier that needs to use numerical/vectorized features.


Solution

  • From the docs

    Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.

    At the time the next string is interned the previous strings may be deleted, and the new one may occasionally get the same id. So keep the references in a container. I'll use a dict:

    featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']
    
    # dict of id:featureVal pairs 
    seen = {}
    
    for featureID,featureVal in enumerate(featureList):
        print "featureID is",featureID
        print "featureVal is ",featureVal
        interned = intern(str(featureVal.encode("utf-8")))
        interned_id = id(interned)
    
        # ensure that no other string with the same id has been seen
        assert interned_id not in seen or seen[interned_id] == featureVal
    
        # change this to seen[interned_id] = None and you'll (probably) get AssertionError
        # from the line above
        seen[interned_id] = interned
    
        print "Encoded feature value is", interned_id